We are thrilled to announce the formal release of Hail 0.2! The interface is now stable.
Hail 0.2 reflects over a year of work based on the experiences of our team and our users with Hail 0.1. It’s a huge step forward in terms of generality, flexibility and power.
Here are some of the major changes:
- The Python interface was completely redesigned. It is now 100% pure Python, and the “Hail expression language” is now gone. Where possible, the Hail types and functions have been made to match the corresponding interfaces in Python. For example, the length of a Hail array expression
a
is nowhl.len(a)
. - Hail 0.1’s
KeyTable
andVariantDataset
are now calledTable
andMatrixTable
, respectively. In 0.2, theVariantDataset
class had a wide variety of methods likesample_qc
andvariant_qc
. This functionality is preserved, but instead is used ashl.sample_qc(mt)
andhl.variant_qc(mt)
. - While a
VariantDataset
was keyed by variants and samples,MatrixTable
s are completely generic in terms of data schema. - In particular,
MatrixTable
s can be grouped by rows or columns with entries aggregated to form newMatrixTable
s. For example, in Hail 0.2, a gene burden test is done by simply grouping variants by gene and then applying a regression function. - The
Variant
andGenotype
data types have been removed. Instead ofVariant
, data imported from VCF / BGEN / GEN will be keyed by alocus
field (of typelocus
) and analleles
field (of typearray<str>
). Instead ofGenotype
, each of these formats will import entry fields appropriate for the input data: all VCF format fields,GT
/dosage
/GP
for BGEN, etc. - Hail 0.2 supports reference genomes, which are tracked as part of the Hail
locus
type. - Aggregator functionality is greatly expanded, including support for grouped aggregations and multivariate, weighted linear regression.
- We added scans, which are running aggregations. These can be used to, for instance, compute a running sum of a table field.
- We added dense and block-sparse BlockMatrices that interoperate with NumPy matrices. These can be used to, for instance, compute genome-wide banded linkage disequilibrium.
- We added a scalable linear mixed model.
- We added Poisson regression.
- We added an experimental plotting library which can handle large datasets by intelligent downsampling.
Does this mean Hail is finished? No! Here are a few of the exciting things we have planned:
- We’re building a tool to lossless-ly import and merge gVCFs that scales linearly with samples and supports incremental sample addition. This will be essential for massive datasets that are coming down the pipe. Here’s a presentation on the prototype at GA4GH 2018.
- Hail 0.2 already includes a simple query optimizer, but performance should improve greatly as we improve it. We’re also prototyping a new C++ code generator that has shown >3x improvement on simple pipelines.
- We’re planning a multi-tenant always-on Hail service. No more spinning up clusters: instant analytics!
- We plan to greatly expand our linear algebra functionality, adding both local and distributed n-dimensional arrays, integrated with the query optimizer. These primitives will in particular support machine learning algorithms for scalable analyses of (single-cell) RNAseq data.
- We’re working on fast, approximate methods to summarize distributions, e.g. quantiles.
- As always, your feedback will factor into development! If you have ideas or requests, we’d love to see them posted to our forum.
n-dimensional (Tensor)Tables will have to wait for 0.3.
Hail 0.2 caveats:
- Unfortunately, Hail 0.2 has a new file format and cannot read Hail 0.1
VariantDataset
andKeyTable
files. - Pipelines from Hail 0.1 will likewise need to be rewritten for 0.2.
- If a piece of functionality is marked as experimental, we reserve the right to modify or remove that functionality during the life of the 0.2 stable release. The plotting library is an example of experimental functionality.
Whether you’re familiar with Hail 0.1 or completely new to Hail, we recommend going through the Hail 0.2 tutorials to learn the new interface: https://hail.is/docs/devel/tutorials-landing.html. And, of course, if you have questions you can find us on the user forum (http://discuss.hail.is) or on Hail Zulip chat (http://hail.zulipchat.com).