The issue is that all vcfs to be imported have to have the exact same samples in them. This means you have to have all the samples you want to work on inside a single vcf file. Why not just add them incrementally? In other word, if I have 3000 samples today, and I get 3000 more samples next week, I have to have duplicate files. Any plans on adjusting the VDS be more like a true data store that you can grow with your data?
Incremental ingest is an incredibly useful model, but it’s one we don’t currently fully support due to the nature of the datasets on hand in the current climate. The outer join on samples and variants is exactly what we want to do here, but it’s complicated by the fact that VCF is a lossy data format – you lose the metadata at monomorphic sites. It would be pretty easy to write an outer join that fills in reference genotypes at all sites not present in a VCF, but this doesn’t seem ideal.
A better solution is to ingest a format like GVCF, which supports lossless variant expansion. We haven’t done this yet because the GATK calling pipeline includes a joint calling step, and the analysts at Broad consider a joint called VCF as the starting point for any downstream analyses.
There are other open-source tools like TileDB in the works which may solve this problem better than we can, given our other priorities. If you have ideas in this domain, we are happy to hear them and discuss further!
Hi Steven,
Thanks for taking an interest in Hail! We appreciate the feedback. I saw your VariantDB challenge a while ago, but we aren’t in a position to submit a solution yet since our genotype schema is currently fixed and doesn’t match your sample file.
In response to your question, I have a few additional remarks to add to Tim’s excellent answer:
Hail is intended to be agnostic to the source of the data. This is similar to the design of Spark, which supports connectors to a wide variety of data sources (SQL databases through JDBC, Parquet, various text or binary files in Hadoop, object storage or local files, Cassandra, Solr, ElasticSearch, MongoDB, etc.)
Admittedly, we haven’t taken great advantage of this yet. Our focus so far has been on the VDS format which stores data in Parquet and optimizes our main use case: queries involving full dataset scans (like computing per-variant or per-sample metrics, association analyses, etc.) and infrequent data drops.
We current support the following data sources:
- The VDS format through
read
andwrite
, - various file formats (TSV, VCF, (B)GEN, BED, etc.) store in file systems supported by Hadoop,
- Tom White has contributed a preliminary connector to Apache Kudu, a new database by Cloudera: http://kudu.apache.org/. It can be accessed through
readkudu
andwritekudu
. Kudu only hit 1.0 recently. The Kudu connector has some limitations (biallelics only, for example) and isn’t integrated into our continuous testing infrastructure. That said, it does support incremental sample addition. - Cassandra and Solr (write only) through the
exportvariantssolr
andexportvariantscass
.
We also have someone who is planning to add support for JDBC and MongoDB. Depending on how the data is represented in the underlying databases, these could naturally support incrementally adding samples.
Loading multiple VCFs with importvcf
does require the samples to be the same (so called vertical concatenation). We also have a join
command that does horizontal concatenation, although it currently does an inner join on variants. We plan to support other join types in the near future.
In other word, if I have 3000 samples today, and I get 3000 more samples next week, I have to have duplicate files.
I didn’t follow. Does join
handle this use case? If getting new data is infrequent, we don’t see occasional reimport from VCF particularly onerous.
Thank you for your valuable insights. Sounds like i need to spend some time digging into the code and contributing some mre drivers for big data formats.