Using hail features at scale in the cloud


At the moment I am storing every sample as gVCF and also in parquet format in the AWS.
I am able to do queries on AWS using Athena, which is great. But Athena was not built specifically for genetics related queries and it is lacking features that Hail has already built-in (for example identity by descent).

Now I am wondering if it would be better to switch from parquet directly to Hail native MatrixTable (as Hail is not able to load the parquet as far as i explored).

To frame a more specific question here is a sample scenario:
Batch of 1000 genomes every month. What is the best way to use hail in the cloud to conduct IBD on 1000 new samples every month (and compare them against all existing samples)?

Should I store every sample in a separate MatrixTable and then pool them pool them together every-time I want to conduct queries across the genomes?

Thank you.

Hey @igorm!

Hail operations are fastest on either the native Matrix Table format or BGEN (the latter is currently better suited to genotyped data than Hail’s native format). If speed is important to you, I recommend importing to Hail’s native format.

We have been building a suite of tools specifically for this use case. We call this stuff the “Variant Dataset” and the “Variant Dataset Combiner”. They’re documented under hl.vds. The interface may change slightly over the next year as we address as yet unforeseen challenges in analyzing this kind of data.

The Variant Dataset (VDS) is a lossless representation of the information in one or more gVCFs. In stark contrast, a project VCF (aka jointly-called VCF) is a lossy representation: it elides reference blocks. The Variant Dataset Combiner is a cheap, scalable, and fast way to combine one or more gVCFs and/or VDSes into a new VDS. IIRC, the current cost per sample is ~0.03 USD.

OK, so, these are just tools, they’re not a complete solution to your problem. If you have a new batch of 1000 genomes every month, you may not want to pay the cost to combine those 1000 samples with an extant dataset of 10,000 samples. Instead, you can import the 1000 gVCF files into a 1000-sample VDS file. Now you have your main 10,000-sample dataset, D0, and your new 1,000-sample dataset, D1. Currently, Hail’s IBD does not have support for incrementally computing the IBD stats, but that’s something we want and we can provide support on this forum if you want to build it. If you built an incremental IBD, then you can compute the IBD stats on D1 with itself as well as D1 with D0’s samples.

Suppose 10 months have gone by and you now have a 10,000-sample VDS and 10 1,000-sample VDSes. If you ever need to reanalyze this entire dataset to, for example, to fit a new statistical model, you will benefit from combining these 11 VDSes into one VDS. All subsequent whole-dataset analyses will benefit from the colocation of data without paying the cost to do so.

Thank you @danking. I will look into the hl.vds.