Export_elasticsearch function documentation for updating behavior

NLSVTN · January 26, 2021, 6:01pm

Hi,

I am wondering whether there is any detailed discussion/documentation besides source code for it.

https://hail.is/docs/0.2/methods/impex.html#hail.methods.export_elasticsearch

What I am interested is how it updates Elasticsearch:

Does it delete and recreate the index?
If not, then if there are some fields absent in the documents (in comparison with the old existing documents) in the index being written will they overwrite (delete) existing fields? If not, then it would mean that some outdated data would be left out.

So, I am wondering what is the mechanism of the update and when I should avoid using the function if there is any possibility of getting inconsistent data in the index.

nawatts · January 26, 2021, 6:59pm

The default configuration can be found in the source code.

It does not delete and recreate the index. By default, it will create the index if it does not exist (based on the es.index.auto.create setting).

With the default configuration, nothing should be overwritten. The es.write.operation setting defaults to index, which means that old documents are replaced by new documents with the same ID. However, the es.mapping.id setting defaults to none, which means new documents will get auto generated IDs.

Setting es.mapping.id is essential for avoiding inconsistency. If an export task for some partition of the Hail table fails (for example, if a task is running on a preemptible Dataproc worker that gets removed from the cluster) then Hail will retry the task. This means that some table rows may be indexed more than once. If es.mapping.id is set, copies will be overwritten. If es.mapping.id is not set, the index may end up with duplicate documents.

NLSVTN · February 22, 2021, 5:57pm

Just to verify, this statement ‘existing data (based on its id) is replaced (reindexed).’ from ‘es.write.operation’ setting means that if I delete a field (in VCF or a pipeline that does annotation) that ends up in a document it will no longer be there (even if the document does exist already)? So, pretty much its full replacement of all of the documents from my understanding, right? Is it checking at all whether the documents to be written are identical or not to the existing ones?

nawatts · February 23, 2021, 1:09pm

Setting es.write.operation to index will cause any existing documents with the same ID to be replaced.

Topic		Replies	Views
Can not export to elasticsearch database, don't see any error log Hail Query & hailctl	9	728	December 13, 2019
Updating index gives number of documents in the index cannot exceed 2147483519 Hail Query & hailctl	5	1933	May 21, 2021
Export ElasticSearch error Hail Query & hailctl	4	567	November 4, 2020
Could not able to export the data to ElasticSearch Hail Query & hailctl	25	4996	March 14, 2019
Problems with elastic search Help [0.1]	16	1733	October 4, 2018

Export_elasticsearch function documentation for updating behavior

Related topics