New method: query_genotypes
We’ve introduced a new method query_genotypes
. If you’re familiar with query_samples
and query_variants
, it’s more of the same. But this method is actually pretty special: it exposes the SQL-like Hail query interface for every genotype in the dataset, as if you just had a very large SQL table with variant. sample, annotation, and genotype columns. But Hail’s matrix-based engine executes this query far more efficiently than traditional SQL databases could.
Examples
>>> call_rate = vds.query_genotypes('gs.fraction(g => g.isCalled())')
Sure, you can get call rate using specific functionality:
>>> call_rate = vds.count(genotypes=True)['callRate']
But Hail sure doesn’t have a specific function for computing a histogram of all GQ values in a dataset, and this is extremely easy to do with query_genotypes.
>>> [hist] = vds.query_genotypes(['gs.map(g => g.gq).hist(0, 100, 100)'])
>>> plt.xlim(0, 101)
>>> plt.ylim(0, 3000000)
>>> plt.xlabel('GQ Bin')
>>> plt.ylabel('Count')
>>> plt.scatter(hist.binEdges[1:], hist.binFrequencies)
>>> plt.show()
No update is complete without breaking changes
You may have expressions like this in the past, which were completely valid until today:
>>> vds.query_variants('variants.filter(v => v.altAllele.isSNP()).count()')
However, this will now cause an error:
TypeError: argument 'exprs' must be a list of str, but found <type 'str'>
All of the query
methods (query_samples
, query_variants
, and the new query_genotypes
) now require a list as the input parameter. This not only makes it clearer that the return type of these functions is a list
, but also (we think) makes it clear that you should execute all your queries at once for efficient pipelines.