Using Hail for Cohort Querying

IqviaGarrettBromley · March 10, 2020, 4:20pm

Currently, we have a vast infrastructure for supporting genomic cohort building (which patients have which variants that match a given criteria specified by the user). These systems are built using oracle/hive db/elastic search, which is able to do the job fairly well, but I wanted to start a discussion in seeing if it is feasible to have a stab at replicating this level of cohorting within Hail through the distributed approach that Hail uses with their Matrix Tables and seeing if this is something to pursue potentially.

Some of the things we are doing currently is allowing the user to build an advanced query using a number of criteria such as the following:

Specific Genes/List of Genes
Variant Class Types
Types of Clinical Significance
And several other types of annotated data

The user can build a complex query, combining these different criteria in any order of ANDs and ORs, groups/parenthesis to fit their cohort requirements.

I envision the flow would be to read the query provided and translate into a command that hail can understand, given all the data is present to support such queries.

Is this something Hail is able to do/do efficiently?

konradjk · March 10, 2020, 4:25pm

This is great! I definitely want to have this conversation and build it out into different queries.

For your first use case, variant-level (or groups of variants) queries should be very straightforward. Any row-based queries should be quite efficient in hail (as filter_rows on a key field narrows to the partitions that have the data in it).

Things that are not the keys (e.g. Types of Clinical Significance) or that need to scan lots of data (I presume by Variant Class Types you mean “all LoFs in this patient” or similar?) are trickier but something I’d love to see happen (we have some column-level queries we’d love to optimize, akin to saying “all the variants in a given patient”).

IqviaGarrettBromley · March 10, 2020, 4:38pm

Yes, that is what I mean by Variant Classes, other types are supported as well (based on the data).

I assume it wouldn’t be too difficult to do, and with some testing on our infrastructure, it might be worth investigating the performance differences between what we currently have in other systems and what we get within hail.

As our analytics is specifically using hail, might be nice to have 1 cohesive system.

cseed · March 10, 2020, 4:55pm

Hi Garrett,

Thanks for the question! I would say this is an awesome use case and something we’d love to be able to support.

Do you also want to constrain sample criteria? E.g. males with disease X? What’s returned by your query? I guess something like the list of matching genotypes, their samples and variants, maybe along with some corresponding sample and/or variant criteria?

While you could probably do this now, the question becomes can you do it fast/efficiently. My guess is probably not without improvements to Hail itself. A few concrete thoughts:

Right now, (Matrix)Tables are only indexed on the key. Hail doesn’t have explicit support for secondary indices. Therefore, you’d need to simulate them by storing each criteria you want to search on, joining them, and then joining it with the main table.

(Matrix)Tables aren’t indexed on samples and it isn’t efficient to pull out a single sample from a large dataset. This is getting requested increasingly often, and we will need a solution for it, although we don’t have a timeline yet.

I’d love if we could make this concrete by getting some representative data and queries we could start benchmarking.

IqviaGarrettBromley · March 10, 2020, 7:10pm

I will put together some realistic use cases for queries on my end and maybe some benchmarks to see how it comes out! This might take a couple weeks to benchmark considering my current workload but I will see about posting queries asap.

IqviaGarrettBromley · October 15, 2020, 6:18pm

I wanted to give an update with the progress on this topic. I’ve built a full genomic querying engine on top of hail supporting the following filtering methods:

Genomic Regions
You are able to pass through a list of regions you are interested in, for example:

Chromosome
Start Position
End Position

Genomic Variants
Ability to pass through a list of variants you are interested in, for example:

Chromosome
Position
Reference Allele Value
Alternate Allele Value

Those allow you to filter the variants according to locus/allele values and ranges (positional based). Obviously, by itself, this isn’t that helpful, so there are additional properties on those criteria to allow you to specify some of the following attributes that the variants meeting the positional criteria from the above:

Variant Class from NCBI dbSNP (synonymous, frameshift)
Variant Type from NCBI dbSNP (single nucleotide variant)
Clinical Significance from Clinvar (benign, likely benign)
Allele Frequency from dataset or Gnomad (various operations available)
Zygosity (homozygous, heterozygous, or either)

After the variants are filtered out based on the positional and variant attribute based information above, it is then aggregated through for each sample to see if the sample has any of the resulting variants and has the proper zygosity specified for that variant/sample combination.

With working on the 1k genome data, and multiplying (and copying the genotype data) up to 100k and 1 million samples, queries are running very fast, under 10 seconds in most cases.

As of right now, the biggest hurdle that we are facing is if one of our QA testers decides to throw a very broad query in the engine which results in us aggregating over billions of entries in the matrix table, even though we filtered out a lot of the rows.

Initially it was expected that the hail.agg.any(...) functionality had a short circuiting mechanism built into it, so when we go to filter the columns that don’t have any of the variants filtered out that it would come across 1 and not need to iterate over any of the variants for that sequence anymore.

As it turns out, however, the aggregation methods have no short circuit functionality in place, and so I believe our biggest bottleneck lies in this aggregation.

For the most part, queries are small and focused on their targeted variants/regions and are inherently very fast, but this is something we want to get better with.

On top of all of this, you are able to build an infinitely nested tree structure with the criteria you specify, so you can mimic something like the following:

(Variant 1 OR Variant 2) AND Synonymous AND Benign

OR

NOT (Region 1 AND Multi-Nucleotide Variant)

That is a simple representation, but shows the kind of boolean logic that can be put into place.

Our user interface allows you to build these queries with ease, and then submit these to the hail cluster for processing. Once you get the results back, you can do a few things with it.

You can export the genomic sample ID’s
Using a clinical cohort you have built using our other tools, you can filter out the cohort to people who have genomic data present and marry up the clinical and genomic patient ID’s of interest and annotate it with feature flags for case/control analysis
This case/control file is then uploaded to the cluster
A GWAS can be executed with various parameters which returns an output that can then be visualized and inspected in the browser

I have included a couple example screenshots below:

Topic		Replies	Views
Is hail a good option for simple querying tasks on a large dataset (using as a "db")? Hail Query & hailctl	4	359	May 15, 2023
Visualization and analytics frontend Hail Query & hailctl	2	652	June 6, 2020
Hail suitability for trans-eQTL Hail Query & hailctl	1	248	May 1, 2023
Variant Annotation Table Merge? Hail Query & hailctl	2	75	April 15, 2025
Announcing Hail 0.2! Updates	2	4898	October 22, 2018

Using Hail for Cohort Querying

Related topics