We’ve introduced a new method
query_genotypes. If you’re familiar with
query_variants, it’s more of the same. But this method is actually pretty special: it exposes the SQL-like Hail query interface for every genotype in the dataset, as if you just had a very large SQL table with variant. sample, annotation, and genotype columns. But Hail’s matrix-based engine executes this query far more efficiently than traditional SQL databases could.
>>> call_rate = vds.query_genotypes('gs.fraction(g => g.isCalled())')
Sure, you can get call rate using specific functionality:
>>> call_rate = vds.count(genotypes=True)['callRate']
But Hail sure doesn’t have a specific function for computing a histogram of all GQ values in a dataset, and this is extremely easy to do with query_genotypes.
>>> [hist] = vds.query_genotypes(['gs.map(g => g.gq).hist(0, 100, 100)']) >>> plt.xlim(0, 101) >>> plt.ylim(0, 3000000) >>> plt.xlabel('GQ Bin') >>> plt.ylabel('Count') >>> plt.scatter(hist.binEdges[1:], hist.binFrequencies) >>> plt.show()
No update is complete without breaking changes
You may have expressions like this in the past, which were completely valid until today:
>>> vds.query_variants('variants.filter(v => v.altAllele.isSNP()).count()')
However, this will now cause an error:
TypeError: argument 'exprs' must be a list of str, but found <type 'str'>
All of the
query methods (
query_variants, and the new
query_genotypes) now require a list as the input parameter. This not only makes it clearer that the return type of these functions is a
list, but also (we think) makes it clear that you should execute all your queries at once for efficient pipelines.