Introducing query_genotypes; breaking changes to query methods

tpoterba · March 4, 2017, 6:19pm

New method: `query_genotypes`

We’ve introduced a new method query_genotypes. If you’re familiar with query_samples and query_variants, it’s more of the same. But this method is actually pretty special: it exposes the SQL-like Hail query interface for every genotype in the dataset, as if you just had a very large SQL table with variant. sample, annotation, and genotype columns. But Hail’s matrix-based engine executes this query far more efficiently than traditional SQL databases could.

Examples

>>> call_rate = vds.query_genotypes('gs.fraction(g => g.isCalled())')

Sure, you can get call rate using specific functionality:

>>> call_rate = vds.count(genotypes=True)['callRate']

But Hail sure doesn’t have a specific function for computing a histogram of all GQ values in a dataset, and this is extremely easy to do with query_genotypes.

>>> [hist] = vds.query_genotypes(['gs.map(g => g.gq).hist(0, 100, 100)'])
>>> plt.xlim(0, 101)
>>> plt.ylim(0, 3000000)
>>> plt.xlabel('GQ Bin')
>>> plt.ylabel('Count')
>>> plt.scatter(hist.binEdges[1:], hist.binFrequencies)
>>> plt.show()

No update is complete without breaking changes

You may have expressions like this in the past, which were completely valid until today:

>>> vds.query_variants('variants.filter(v => v.altAllele.isSNP()).count()')

However, this will now cause an error:

TypeError: argument 'exprs' must be a list of str, but found <type 'str'>

All of the query methods (query_samples, query_variants, and the new query_genotypes) now require a list as the input parameter. This not only makes it clearer that the return type of these functions is a list, but also (we think) makes it clear that you should execute all your queries at once for efficient pipelines.

tpoterba · March 24, 2017, 3:50pm

We reverted these breaking changes – they were a bad idea.

Something like:

>>> result = vds.query_variants('variants.filter(v => v.altAllele.isSNP()).count()'')

is totally valid syntax again.

Topic		Replies	Views
Announcing Hail 0.2! Updates	2	4894	October 22, 2018
Querying variants by genotype counts for two cohorts Hail Query & hailctl	1	333	November 8, 2021
New Python features; print_schema and show_globals removed Updates	0	778	January 28, 2017
Genotype matrix in hail 0.2 Hail Query & hailctl	5	748	April 15, 2019
Is there a way to get a list of non-reference heterozygous and homozygous variants by gene for each sample from a MatrixTable? Hail Query & hailctl	0	206	January 8, 2024

Introducing query_genotypes; breaking changes to query methods

New method: query_genotypes

No update is complete without breaking changes

Related topics

New method: `query_genotypes`