Currently, we have a vast infrastructure for supporting genomic cohort building (which patients have which variants that match a given criteria specified by the user). These systems are built using oracle/hive db/elastic search, which is able to do the job fairly well, but I wanted to start a discussion in seeing if it is feasible to have a stab at replicating this level of cohorting within Hail through the distributed approach that Hail uses with their Matrix Tables and seeing if this is something to pursue potentially.
Some of the things we are doing currently is allowing the user to build an advanced query using a number of criteria such as the following:
- Specific Genes/List of Genes
- Variant Class Types
- Types of Clinical Significance
- And several other types of annotated data
The user can build a complex query, combining these different criteria in any order of ANDs and ORs, groups/parenthesis to fit their cohort requirements.
I envision the flow would be to read the query provided and translate into a command that hail can understand, given all the data is present to support such queries.
Is this something Hail is able to do/do efficiently?