Using Hail for Cohort Querying

Currently, we have a vast infrastructure for supporting genomic cohort building (which patients have which variants that match a given criteria specified by the user). These systems are built using oracle/hive db/elastic search, which is able to do the job fairly well, but I wanted to start a discussion in seeing if it is feasible to have a stab at replicating this level of cohorting within Hail through the distributed approach that Hail uses with their Matrix Tables and seeing if this is something to pursue potentially.

Some of the things we are doing currently is allowing the user to build an advanced query using a number of criteria such as the following:

  • Specific Genes/List of Genes
  • Variant Class Types
  • Types of Clinical Significance
  • And several other types of annotated data

The user can build a complex query, combining these different criteria in any order of ANDs and ORs, groups/parenthesis to fit their cohort requirements.

I envision the flow would be to read the query provided and translate into a command that hail can understand, given all the data is present to support such queries.

Is this something Hail is able to do/do efficiently?

This is great! I definitely want to have this conversation and build it out into different queries.

For your first use case, variant-level (or groups of variants) queries should be very straightforward. Any row-based queries should be quite efficient in hail (as filter_rows on a key field narrows to the partitions that have the data in it).

Things that are not the keys (e.g. Types of Clinical Significance) or that need to scan lots of data (I presume by Variant Class Types you mean “all LoFs in this patient” or similar?) are trickier but something I’d love to see happen (we have some column-level queries we’d love to optimize, akin to saying “all the variants in a given patient”).

1 Like

Yes, that is what I mean by Variant Classes, other types are supported as well (based on the data).

I assume it wouldn’t be too difficult to do, and with some testing on our infrastructure, it might be worth investigating the performance differences between what we currently have in other systems and what we get within hail.

As our analytics is specifically using hail, might be nice to have 1 cohesive system.

Hi Garrett,

Thanks for the question! I would say this is an awesome use case and something we’d love to be able to support.

Do you also want to constrain sample criteria? E.g. males with disease X? What’s returned by your query? I guess something like the list of matching genotypes, their samples and variants, maybe along with some corresponding sample and/or variant criteria?

While you could probably do this now, the question becomes can you do it fast/efficiently. My guess is probably not without improvements to Hail itself. A few concrete thoughts:

Right now, (Matrix)Tables are only indexed on the key. Hail doesn’t have explicit support for secondary indices. Therefore, you’d need to simulate them by storing each criteria you want to search on, joining them, and then joining it with the main table.

(Matrix)Tables aren’t indexed on samples and it isn’t efficient to pull out a single sample from a large dataset. This is getting requested increasingly often, and we will need a solution for it, although we don’t have a timeline yet.

I’d love if we could make this concrete by getting some representative data and queries we could start benchmarking.

1 Like

I will put together some realistic use cases for queries on my end and maybe some benchmarks to see how it comes out! This might take a couple weeks to benchmark considering my current workload but I will see about posting queries asap.