Using Hail for Cohort Querying

Currently, we have a vast infrastructure for supporting genomic cohort building (which patients have which variants that match a given criteria specified by the user). These systems are built using oracle/hive db/elastic search, which is able to do the job fairly well, but I wanted to start a discussion in seeing if it is feasible to have a stab at replicating this level of cohorting within Hail through the distributed approach that Hail uses with their Matrix Tables and seeing if this is something to pursue potentially.

Some of the things we are doing currently is allowing the user to build an advanced query using a number of criteria such as the following:

  • Specific Genes/List of Genes
  • Variant Class Types
  • Types of Clinical Significance
  • And several other types of annotated data

The user can build a complex query, combining these different criteria in any order of ANDs and ORs, groups/parenthesis to fit their cohort requirements.

I envision the flow would be to read the query provided and translate into a command that hail can understand, given all the data is present to support such queries.

Is this something Hail is able to do/do efficiently?

This is great! I definitely want to have this conversation and build it out into different queries.

For your first use case, variant-level (or groups of variants) queries should be very straightforward. Any row-based queries should be quite efficient in hail (as filter_rows on a key field narrows to the partitions that have the data in it).

Things that are not the keys (e.g. Types of Clinical Significance) or that need to scan lots of data (I presume by Variant Class Types you mean “all LoFs in this patient” or similar?) are trickier but something I’d love to see happen (we have some column-level queries we’d love to optimize, akin to saying “all the variants in a given patient”).

1 Like

Yes, that is what I mean by Variant Classes, other types are supported as well (based on the data).

I assume it wouldn’t be too difficult to do, and with some testing on our infrastructure, it might be worth investigating the performance differences between what we currently have in other systems and what we get within hail.

As our analytics is specifically using hail, might be nice to have 1 cohesive system.

Hi Garrett,

Thanks for the question! I would say this is an awesome use case and something we’d love to be able to support.

Do you also want to constrain sample criteria? E.g. males with disease X? What’s returned by your query? I guess something like the list of matching genotypes, their samples and variants, maybe along with some corresponding sample and/or variant criteria?

While you could probably do this now, the question becomes can you do it fast/efficiently. My guess is probably not without improvements to Hail itself. A few concrete thoughts:

Right now, (Matrix)Tables are only indexed on the key. Hail doesn’t have explicit support for secondary indices. Therefore, you’d need to simulate them by storing each criteria you want to search on, joining them, and then joining it with the main table.

(Matrix)Tables aren’t indexed on samples and it isn’t efficient to pull out a single sample from a large dataset. This is getting requested increasingly often, and we will need a solution for it, although we don’t have a timeline yet.

I’d love if we could make this concrete by getting some representative data and queries we could start benchmarking.


I will put together some realistic use cases for queries on my end and maybe some benchmarks to see how it comes out! This might take a couple weeks to benchmark considering my current workload but I will see about posting queries asap.

1 Like

I wanted to give an update with the progress on this topic. I’ve built a full genomic querying engine on top of hail supporting the following filtering methods:

Genomic Regions
You are able to pass through a list of regions you are interested in, for example:

  • Chromosome
  • Start Position
  • End Position

Genomic Variants
Ability to pass through a list of variants you are interested in, for example:

  • Chromosome
  • Position
  • Reference Allele Value
  • Alternate Allele Value

Those allow you to filter the variants according to locus/allele values and ranges (positional based). Obviously, by itself, this isn’t that helpful, so there are additional properties on those criteria to allow you to specify some of the following attributes that the variants meeting the positional criteria from the above:

  • Variant Class from NCBI dbSNP (synonymous, frameshift)
  • Variant Type from NCBI dbSNP (single nucleotide variant)
  • Clinical Significance from Clinvar (benign, likely benign)
  • Allele Frequency from dataset or Gnomad (various operations available)
  • Zygosity (homozygous, heterozygous, or either)

After the variants are filtered out based on the positional and variant attribute based information above, it is then aggregated through for each sample to see if the sample has any of the resulting variants and has the proper zygosity specified for that variant/sample combination.

With working on the 1k genome data, and multiplying (and copying the genotype data) up to 100k and 1 million samples, queries are running very fast, under 10 seconds in most cases.

As of right now, the biggest hurdle that we are facing is if one of our QA testers decides to throw a very broad query in the engine which results in us aggregating over billions of entries in the matrix table, even though we filtered out a lot of the rows.

Initially it was expected that the hail.agg.any(...) functionality had a short circuiting mechanism built into it, so when we go to filter the columns that don’t have any of the variants filtered out that it would come across 1 and not need to iterate over any of the variants for that sequence anymore.

As it turns out, however, the aggregation methods have no short circuit functionality in place, and so I believe our biggest bottleneck lies in this aggregation.

For the most part, queries are small and focused on their targeted variants/regions and are inherently very fast, but this is something we want to get better with.

On top of all of this, you are able to build an infinitely nested tree structure with the criteria you specify, so you can mimic something like the following:

(Variant 1 OR Variant 2) AND Synonymous AND Benign


NOT (Region 1 AND Multi-Nucleotide Variant)

That is a simple representation, but shows the kind of boolean logic that can be put into place.

Our user interface allows you to build these queries with ease, and then submit these to the hail cluster for processing. Once you get the results back, you can do a few things with it.

  1. You can export the genomic sample ID’s
  2. Using a clinical cohort you have built using our other tools, you can filter out the cohort to people who have genomic data present and marry up the clinical and genomic patient ID’s of interest and annotate it with feature flags for case/control analysis
  3. This case/control file is then uploaded to the cluster
  4. A GWAS can be executed with various parameters which returns an output that can then be visualized and inspected in the browser

I have included a couple example screenshots below: