Using hail features at scale in the cloud

igorm · August 29, 2022, 7:51am

Hi,

At the moment I am storing every sample as gVCF and also in parquet format in the AWS.
I am able to do queries on AWS using Athena, which is great. But Athena was not built specifically for genetics related queries and it is lacking features that Hail has already built-in (for example identity by descent).

Now I am wondering if it would be better to switch from parquet directly to Hail native MatrixTable (as Hail is not able to load the parquet as far as i explored).

To frame a more specific question here is a sample scenario:
Batch of 1000 genomes every month. What is the best way to use hail in the cloud to conduct IBD on 1000 new samples every month (and compare them against all existing samples)?

Should I store every sample in a separate MatrixTable and then pool them pool them together every-time I want to conduct queries across the genomes?

Thank you.

danking · August 30, 2022, 1:30pm

Hey @igorm!

Hail operations are fastest on either the native Matrix Table format or BGEN (the latter is currently better suited to genotyped data than Hail’s native format). If speed is important to you, I recommend importing to Hail’s native format.

We have been building a suite of tools specifically for this use case. We call this stuff the “Variant Dataset” and the “Variant Dataset Combiner”. They’re documented under hl.vds. The interface may change slightly over the next year as we address as yet unforeseen challenges in analyzing this kind of data.

The Variant Dataset (VDS) is a lossless representation of the information in one or more gVCFs. In stark contrast, a project VCF (aka jointly-called VCF) is a lossy representation: it elides reference blocks. The Variant Dataset Combiner is a cheap, scalable, and fast way to combine one or more gVCFs and/or VDSes into a new VDS. IIRC, the current cost per sample is ~0.03 USD.

OK, so, these are just tools, they’re not a complete solution to your problem. If you have a new batch of 1000 genomes every month, you may not want to pay the cost to combine those 1000 samples with an extant dataset of 10,000 samples. Instead, you can import the 1000 gVCF files into a 1000-sample VDS file. Now you have your main 10,000-sample dataset, D0, and your new 1,000-sample dataset, D1. Currently, Hail’s IBD does not have support for incrementally computing the IBD stats, but that’s something we want and we can provide support on this forum if you want to build it. If you built an incremental IBD, then you can compute the IBD stats on D1 with itself as well as D1 with D0’s samples.

Suppose 10 months have gone by and you now have a 10,000-sample VDS and 10 1,000-sample VDSes. If you ever need to reanalyze this entire dataset to, for example, to fit a new statistical model, you will benefit from combining these 11 VDSes into one VDS. All subsequent whole-dataset analyses will benefit from the colocation of data without paying the cost to do so.

igorm · August 30, 2022, 3:27pm

Thank you @danking. I will look into the hl.vds.

Topic		Replies	Views
Is hail a good option for simple querying tasks on a large dataset (using as a "db")? Hail Query & hailctl	4	362	May 15, 2023
Filter variants by sample id in gVCF Help [0.1]	20	1545	February 27, 2019
Import data from dataframe parquet into vds Help [0.1]	3	651	July 24, 2018
Optimise querying code Hail Query & hailctl	8	436	September 14, 2020
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2195	February 8, 2020

Using hail features at scale in the cloud

Related topics