We have an Elasticsearch index which has 42522080 documents. We try updating it by retrieving VCF file from which the index was generated (from ES index metadata), annotating it and then writing it back.
The code is presented below:
mt = hl.import_vcf(dataset_path, reference_genome='GRCh' + genome_version, force_bgz=True, min_partitions=500)
mt = hl.split_multi_hts(mt.annotate_rows(locus_old=mt.locus, alleles_old=mt.alleles), permit_shuffle=True)
...
variant_count = mt.count_rows()
logger.info("\n==> exporting {} variants to elasticsearch:".format(variant_count))
row_table = mt.rows().flatten()
row_table = row_table.drop(row_table.locus, row_table.alleles)
hl.export_elasticsearch(row_table, ...)
We verified that variant_count is 42522080 however, when export_elasticsearch function is run the error is produced:
hail.utils.java.FatalError: EsHadoopException: Could not write all entries for bulk operation [93/1000]. Error sample (first [5] error messages):
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
number of documents in the index cannot exceed 2147483519
Bailing out…
So, while writing it into Elasticsearch the number of documents generated and written seems to exceed the Lucene’s capacity (which is strange given that the total number of variants is 42522080)
So, I am wondering whether its the issue of Hail’s split_multi_hts function, and what I could try in Hail, which parameters, settings to avoid that. I might ask a similar question also on Elasticsearch forum since I am not fully sure that it should go here.