Updating index gives number of documents in the index cannot exceed 2147483519

NLSVTN · May 20, 2021, 1:33pm

We have an Elasticsearch index which has 42522080 documents. We try updating it by retrieving VCF file from which the index was generated (from ES index metadata), annotating it and then writing it back.

The code is presented below:

mt = hl.import_vcf(dataset_path, reference_genome='GRCh' + genome_version, force_bgz=True, min_partitions=500)
mt = hl.split_multi_hts(mt.annotate_rows(locus_old=mt.locus, alleles_old=mt.alleles), permit_shuffle=True)

...

variant_count = mt.count_rows()
logger.info("\n==> exporting {} variants to elasticsearch:".format(variant_count))

row_table = mt.rows().flatten()
row_table = row_table.drop(row_table.locus, row_table.alleles)

hl.export_elasticsearch(row_table, ...)

We verified that variant_count is 42522080 however, when export_elasticsearch function is run the error is produced:

hail.utils.java.FatalError: EsHadoopException: Could not write all entries for bulk operation [93/1000]. Error sample (first [5] error messages):
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
Bailing out…

So, while writing it into Elasticsearch the number of documents generated and written seems to exceed the Lucene’s capacity (which is strange given that the total number of variants is 42522080)

So, I am wondering whether its the issue of Hail’s split_multi_hts function, and what I could try in Hail, which parameters, settings to avoid that. I might ask a similar question also on Elasticsearch forum since I am not fully sure that it should go here.

nawatts · May 20, 2021, 2:50pm

Are you setting es.mapping.id in the config passed to hl.export_elasticsearch? If not, duplicate documents may be written to the ES index (for example, if a task running on a preemptible machine is preempted midway through processing a partition).

Also, does the mapping for this index include any nested fields? Values for nested fields are stored as separate documents. Thus, when using nested fields, more than one ES document may be created for a row in a Hail Table.

NLSVTN · May 21, 2021, 2:26pm

Yeah, I am using es.mapping.id, so its nested fields most probably. Is there a way to calculate the number of all documents that will be generated for a MatrixTable, so that I could set the number of primary shards to a different number?

.

nawatts · May 21, 2021, 2:41pm

It would depend on the mapping used for your ES index. Generally, nested fields are useful for arrays of objects. Is there a field in your table that contains a collection of structs and is mapped as nested?

NLSVTN · May 21, 2021, 3:46pm

I see there 3 nested fields: one has 7 array fields, another - 26, third - 5. Would the number be then (7 + 26 + 5 + 1) * 42522080? But then its still less than 2147483519 limit. There are no nested within nested, just 3 nested at the same level. Or each item in each array will be like a document?

nawatts · May 21, 2021, 4:05pm

Each element in the array/set/collection would be a separate document for fields mapped as nested.

So if a Table row has a field containing an array of length 3, there are 4 documents created for that row.

Topic		Replies	Views
Export_elasticsearch function documentation for updating behavior Hail Query & hailctl	3	483	February 23, 2021
Figure out shard size from size of MatrixTable that will be written to Elasticsearch Hail Query & hailctl	0	309	June 17, 2021
Can not export to elasticsearch database, don't see any error log Hail Query & hailctl	9	728	December 13, 2019
Bug Hail 1 to Elastic Search Help [0.1]	1	1191	October 5, 2018
Problems with elastic search Help [0.1]	16	1733	October 4, 2018

Updating index gives number of documents in the index cannot exceed 2147483519

Related topics