Export_elasticsearch function documentation for updating behavior

Hi,

I am wondering whether there is any detailed discussion/documentation besides source code for it.

https://hail.is/docs/0.2/methods/impex.html#hail.methods.export_elasticsearch

What I am interested is how it updates Elasticsearch:

  1. Does it delete and recreate the index?
  2. If not, then if there are some fields absent in the documents (in comparison with the old existing documents) in the index being written will they overwrite (delete) existing fields? If not, then it would mean that some outdated data would be left out.

So, I am wondering what is the mechanism of the update and when I should avoid using the function if there is any possibility of getting inconsistent data in the index.

The default configuration can be found in the source code.

It does not delete and recreate the index. By default, it will create the index if it does not exist (based on the es.index.auto.create setting).

With the default configuration, nothing should be overwritten. The es.write.operation setting defaults to index, which means that old documents are replaced by new documents with the same ID. However, the es.mapping.id setting defaults to none, which means new documents will get auto generated IDs.

Setting es.mapping.id is essential for avoiding inconsistency. If an export task for some partition of the Hail table fails (for example, if a task is running on a preemptible Dataproc worker that gets removed from the cluster) then Hail will retry the task. This means that some table rows may be indexed more than once. If es.mapping.id is set, copies will be overwritten. If es.mapping.id is not set, the index may end up with duplicate documents.

Just to verify, this statement ‘existing data (based on its id) is replaced (reindexed).’ from ‘es.write.operation’ setting means that if I delete a field (in VCF or a pipeline that does annotation) that ends up in a document it will no longer be there (even if the document does exist already)? So, pretty much its full replacement of all of the documents from my understanding, right? Is it checking at all whether the documents to be written are identical or not to the existing ones?

Setting es.write.operation to index will cause any existing documents with the same ID to be replaced.

1 Like