Adding samples incrementally

Steven-N-Hart · November 20, 2016, 11:19pm

The issue is that all vcfs to be imported have to have the exact same samples in them. This means you have to have all the samples you want to work on inside a single vcf file. Why not just add them incrementally? In other word, if I have 3000 samples today, and I get 3000 more samples next week, I have to have duplicate files. Any plans on adjusting the VDS be more like a true data store that you can grow with your data?

tpoterba · November 21, 2016, 11:10pm

Incremental ingest is an incredibly useful model, but it’s one we don’t currently fully support due to the nature of the datasets on hand in the current climate. The outer join on samples and variants is exactly what we want to do here, but it’s complicated by the fact that VCF is a lossy data format – you lose the metadata at monomorphic sites. It would be pretty easy to write an outer join that fills in reference genotypes at all sites not present in a VCF, but this doesn’t seem ideal.

A better solution is to ingest a format like GVCF, which supports lossless variant expansion. We haven’t done this yet because the GATK calling pipeline includes a joint calling step, and the analysts at Broad consider a joint called VCF as the starting point for any downstream analyses.

There are other open-source tools like TileDB in the works which may solve this problem better than we can, given our other priorities. If you have ideas in this domain, we are happy to hear them and discuss further!

cseed · November 26, 2016, 9:25pm

Hi Steven,

Thanks for taking an interest in Hail! We appreciate the feedback. I saw your VariantDB challenge a while ago, but we aren’t in a position to submit a solution yet since our genotype schema is currently fixed and doesn’t match your sample file.

In response to your question, I have a few additional remarks to add to Tim’s excellent answer:

Hail is intended to be agnostic to the source of the data. This is similar to the design of Spark, which supports connectors to a wide variety of data sources (SQL databases through JDBC, Parquet, various text or binary files in Hadoop, object storage or local files, Cassandra, Solr, ElasticSearch, MongoDB, etc.)

Admittedly, we haven’t taken great advantage of this yet. Our focus so far has been on the VDS format which stores data in Parquet and optimizes our main use case: queries involving full dataset scans (like computing per-variant or per-sample metrics, association analyses, etc.) and infrequent data drops.

We current support the following data sources:

The VDS format through read and write,
various file formats (TSV, VCF, (B)GEN, BED, etc.) store in file systems supported by Hadoop,
Tom White has contributed a preliminary connector to Apache Kudu, a new database by Cloudera: http://kudu.apache.org/. It can be accessed through readkudu and writekudu. Kudu only hit 1.0 recently. The Kudu connector has some limitations (biallelics only, for example) and isn’t integrated into our continuous testing infrastructure. That said, it does support incremental sample addition.
Cassandra and Solr (write only) through the exportvariantssolr and exportvariantscass.

We also have someone who is planning to add support for JDBC and MongoDB. Depending on how the data is represented in the underlying databases, these could naturally support incrementally adding samples.

Loading multiple VCFs with importvcf does require the samples to be the same (so called vertical concatenation). We also have a join command that does horizontal concatenation, although it currently does an inner join on variants. We plan to support other join types in the near future.

In other word, if I have 3000 samples today, and I get 3000 more samples next week, I have to have duplicate files.

I didn’t follow. Does join handle this use case? If getting new data is infrequent, we don’t see occasional reimport from VCF particularly onerous.

Steven-N-Hart · November 28, 2016, 1:19pm

Thank you for your valuable insights. Sounds like i need to spend some time digging into the code and contributing some mre drivers for big data formats.

rajwanir2 · February 27, 2025, 5:02pm

Hello,

Sorry for posting on this thread nine years later from original post.

Just wanted to check if there are any improvements to incremental addition of new samples? Similar to original post, say we have 100,000 samples * 700,000 variants in a dataset and would like to add 1000 samples at regular frequency (week/month). Would joins (Hail | Table Joins Tutorial) be efficient enough (i.e. using 1-4 CPUS, <10 Gb memory and finish in a few hours at most) for such operations?

My dataset is array VCF so variants across tables remain constant. Only new samples need to be added.

Much appreciate if there are any suggestions/advice to assess suitability of Hail for this task.

Thank you.

Topic		Replies	Views
Loading a large and growing cohort into Hail Hail Query & hailctl	1	927	May 16, 2019
Import multi sample vcf Hail Query & hailctl	6	505	October 24, 2022
Loading many datasets into single VDS? Help [0.1]	3	1880	January 31, 2018
Sample wise VEP annotation for Rare Variant Disease(Exome) vcf files through hail with my custom databases Hail Query & hailctl	9	362	June 12, 2023
Importing many sample-specific VCFs Hail Query & hailctl	12	1204	December 12, 2022

Adding samples incrementally

Related topics