Updating index gives number of documents in the index cannot exceed 2147483519

We have an Elasticsearch index which has 42522080 documents. We try updating it by retrieving VCF file from which the index was generated (from ES index metadata), annotating it and then writing it back.

The code is presented below:

mt = hl.import_vcf(dataset_path, reference_genome='GRCh' + genome_version, force_bgz=True, min_partitions=500)
mt = hl.split_multi_hts(mt.annotate_rows(locus_old=mt.locus, alleles_old=mt.alleles), permit_shuffle=True)

...

variant_count = mt.count_rows()
logger.info("\n==> exporting {} variants to elasticsearch:".format(variant_count))

row_table = mt.rows().flatten()
row_table = row_table.drop(row_table.locus, row_table.alleles)

hl.export_elasticsearch(row_table, ...)

We verified that variant_count is 42522080 however, when export_elasticsearch function is run the error is produced:

hail.utils.java.FatalError: EsHadoopException: Could not write all entries for bulk operation [93/1000]. Error sample (first [5] error messages):
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
Bailing out…

So, while writing it into Elasticsearch the number of documents generated and written seems to exceed the Lucene’s capacity (which is strange given that the total number of variants is 42522080)

So, I am wondering whether its the issue of Hail’s split_multi_hts function, and what I could try in Hail, which parameters, settings to avoid that. I might ask a similar question also on Elasticsearch forum since I am not fully sure that it should go here.

Are you setting es.mapping.id in the config passed to hl.export_elasticsearch? If not, duplicate documents may be written to the ES index (for example, if a task running on a preemptible machine is preempted midway through processing a partition).

Also, does the mapping for this index include any nested fields? Values for nested fields are stored as separate documents. Thus, when using nested fields, more than one ES document may be created for a row in a Hail Table.

1 Like

Yeah, I am using es.mapping.id, so its nested fields most probably. Is there a way to calculate the number of all documents that will be generated for a MatrixTable, so that I could set the number of primary shards to a different number?

.

It would depend on the mapping used for your ES index. Generally, nested fields are useful for arrays of objects. Is there a field in your table that contains a collection of structs and is mapped as nested?

I see there 3 nested fields: one has 7 array fields, another - 26, third - 5. Would the number be then (7 + 26 + 5 + 1) * 42522080? But then its still less than 2147483519 limit. There are no nested within nested, just 3 nested at the same level. Or each item in each array will be like a document?

Each element in the array/set/collection would be a separate document for fields mapped as nested.

So if a Table row has a field containing an array of length 3, there are 4 documents created for that row.

1 Like