I wanted to give an update with the progress on this topic. I’ve built a full genomic querying engine on top of hail supporting the following filtering methods:
Genomic Regions
You are able to pass through a list of regions you are interested in, for example:
- Chromosome
- Start Position
- End Position
Genomic Variants
Ability to pass through a list of variants you are interested in, for example:
- Chromosome
- Position
- Reference Allele Value
- Alternate Allele Value
Those allow you to filter the variants according to locus/allele values and ranges (positional based). Obviously, by itself, this isn’t that helpful, so there are additional properties on those criteria to allow you to specify some of the following attributes that the variants meeting the positional criteria from the above:
- Variant Class from NCBI dbSNP (synonymous, frameshift)
- Variant Type from NCBI dbSNP (single nucleotide variant)
- Clinical Significance from Clinvar (benign, likely benign)
- Allele Frequency from dataset or Gnomad (various operations available)
- Zygosity (homozygous, heterozygous, or either)
After the variants are filtered out based on the positional and variant attribute based information above, it is then aggregated through for each sample to see if the sample has any of the resulting variants and has the proper zygosity specified for that variant/sample combination.
With working on the 1k genome data, and multiplying (and copying the genotype data) up to 100k and 1 million samples, queries are running very fast, under 10 seconds in most cases.
As of right now, the biggest hurdle that we are facing is if one of our QA testers decides to throw a very broad query in the engine which results in us aggregating over billions of entries in the matrix table, even though we filtered out a lot of the rows.
Initially it was expected that the hail.agg.any(...)
functionality had a short circuiting mechanism built into it, so when we go to filter the columns that don’t have any of the variants filtered out that it would come across 1 and not need to iterate over any of the variants for that sequence anymore.
As it turns out, however, the aggregation methods have no short circuit functionality in place, and so I believe our biggest bottleneck lies in this aggregation.
For the most part, queries are small and focused on their targeted variants/regions and are inherently very fast, but this is something we want to get better with.
On top of all of this, you are able to build an infinitely nested tree structure with the criteria you specify, so you can mimic something like the following:
(Variant 1 OR Variant 2) AND Synonymous AND Benign
OR
NOT (Region 1 AND Multi-Nucleotide Variant)
That is a simple representation, but shows the kind of boolean logic that can be put into place.
Our user interface allows you to build these queries with ease, and then submit these to the hail cluster for processing. Once you get the results back, you can do a few things with it.
- You can export the genomic sample ID’s
- Using a clinical cohort you have built using our other tools, you can filter out the cohort to people who have genomic data present and marry up the clinical and genomic patient ID’s of interest and annotate it with feature flags for case/control analysis
- This case/control file is then uploaded to the cluster
- A GWAS can be executed with various parameters which returns an output that can then be visualized and inspected in the browser
I have included a couple example screenshots below: