Introducing query_genotypes; breaking changes to query methods

New method: query_genotypes

We’ve introduced a new method query_genotypes. If you’re familiar with query_samples and query_variants, it’s more of the same. But this method is actually pretty special: it exposes the SQL-like Hail query interface for every genotype in the dataset, as if you just had a very large SQL table with variant. sample, annotation, and genotype columns. But Hail’s matrix-based engine executes this query far more efficiently than traditional SQL databases could.


>>> call_rate = vds.query_genotypes('gs.fraction(g => g.isCalled())')

Sure, you can get call rate using specific functionality:

>>> call_rate = vds.count(genotypes=True)['callRate']

But Hail sure doesn’t have a specific function for computing a histogram of all GQ values in a dataset, and this is extremely easy to do with query_genotypes.

>>> [hist] = vds.query_genotypes([' =>, 100, 100)'])
>>> plt.xlim(0, 101)
>>> plt.ylim(0, 3000000)
>>> plt.xlabel('GQ Bin')
>>> plt.ylabel('Count')
>>> plt.scatter(hist.binEdges[1:], hist.binFrequencies)

No update is complete without breaking changes

You may have expressions like this in the past, which were completely valid until today:

>>> vds.query_variants('variants.filter(v => v.altAllele.isSNP()).count()')

However, this will now cause an error:

TypeError: argument 'exprs' must be a list of str, but found <type 'str'>

All of the query methods (query_samples, query_variants, and the new query_genotypes) now require a list as the input parameter. This not only makes it clearer that the return type of these functions is a list, but also (we think) makes it clear that you should execute all your queries at once for efficient pipelines.

We reverted these breaking changes – they were a bad idea.

Something like:

>>> result = vds.query_variants('variants.filter(v => v.altAllele.isSNP()).count()'')

is totally valid syntax again.