Hello Hail team,
I’m trying to figure out is hail a good option for my tasks and come to the point where I need some advice.
In my settings I have set of tens of thousands vcf (also have a gvcf but that is a different story) files.
I’d like to integrate them together in a sort of a DB for being able to make a simple queries like:
- give me all sample IDs with the given rsID (of variants in a given locus)
- give me all variants in a given gene/locus
- calculate frequency over subset of samples
and other of that sort.
(which seems as a completely different task) add to that integration (“db”) new samples (batches) time to time.
I’m working on a local server which has SSD and 512GB of RAM
I’ve tested two approaches - with plink and hail (and was also suggested to recruit some programmers and just load everything into oracle, but have not investigate that properly so far).
So far I’m able to assemble plink and hail per chromosome files in couple of days for a set of 10k samples.
PS Hail matrixes I eventually had to constructed by importing plink files, as far as I failed to find a correct way of importing and integrating vcf (not a gvcf) files.
As far as I see it for now hail does not particularly well suited for the task of querying (using it as a db) because (as far as I understand) it lack the random access to the data due to distributed architecture. And plink seems to be a faster option here.
But my concerns are: 1) that I’ll hit the limits of plink capability at some point 2) that I’ll need to perform some proper data wrangling (GWAS, PRS, etc) and hail might be more suitable.
Concerning the second task (addition of the a single sample to “db”) - is it possible to perform such operation efficiently in hail (without reading/writing the whole dataset, no joint calling just a simple addition of vcf file)?
Should I consider an option of proper DB creation?
PS my I apology if these questions are a bit naïve
Is this genotype chip data or sequencing data?
Hail supports fast row filtering if you use the row key. Hail MatrixTables are traditionally keyed by locus (and alleles). Any of the following operation should read O(subset) amount of data:
filter_rows(mt.locus == ...)
Therefore, for Hail to rapidly subset on the variant access you need to separately maintain:
- A map from rsID to locus & alleles.
- A map from gene name to locus-interval.
Hail lacks fast subsetting of samples.
Hail Table and Matrix Table files are not mutable. Adding a sample requires reading and writing. You can join two matrix tables or two tables on the fly:
mt1 = hl.read_matrix_table(...)
mt2 = hl.read_matrix_table(...)
mt_union = mt1.union_cols(mt2)
You can keep each batch as a separate matrix table until you find the cost of joining outweighs the cost of a read-write.
And, all that said, PLINK is very fast. If you’re not doing serious computation you might find PLINK reasonable. Honestly, if you’re just trying to return lists of GTs rapidly, you might be better off with a traditional database or sorted-key-value store.
thank you for sharing your thoughts and thank you for developing Hail.
Seems that I need to define the usage cases in more details before making a choice.
PS - these are sequencing data (WGS), sorry, I thought that I wrote that in the initial post. It probably would not be a problem if I would work with arrays…
You’re probably already aware, but, for future readers: when using sequencing data, unless all your VCF files have the same set of variants,
union_cols won’t work. You need to start from the GVCF files so that you can confidently distinguish between “no call” / NA / missing data and homozygous reference calls. You could use union_cols if you restrict to the set of variants common across all files.
you might consider a proper genome variant store for these use cases.
E.g. see https://dnaerys.org