Big picture issues: considering switching to HAIL

tpoterba · June 14, 2019, 12:53pm

I think this is a good discussion to have, and am happy to go in-depth here. Many of these questions have come up before, and I imagine that we’ll want to take some of this material and copy it to a FAQ section on the website.

I’ll try to address each area of questions sequentially.

Hail maintenance and versioning

You’re not the first to be concerned about using software designated with an initial development (0.X) version. Despite the “semantic” in semantic versioning, there is some heterogeneity in what people are trying to communicate using versioning schemes. For instance, the de facto Python library for tabular data analysis is pandas, which is on its 0.24 release – still not 1.0. With respect to Hail:

interfaces are stable within the major (0.X) versions.
we have had 2 major versions so far (0.1 and 0.2). We don’t anticipate 0.3 being released in the next 6-9 months, because it’s possible to develop the backend without changing the 0.2 interfaces for a while.
Every commit (30-60 per week) is subject to code review and must pass all automated tests.
Bugs are fixed aggressively, and a list of fixed bugs is found in the change log on the website.

So why isn’t Hail in 1.0? We don’t view Hail as “buggy”, but we do view it as “partial”. Our vision for a 1.0 release is far broader than 0.1 or 0.2. Hail 0.1 was a library that was well-suited for processing high-throughput sequencing VCF-like data. 0.2 is more general-purpose, useful for processing 2-dimensional structured data, with special focus on genomics. But the 1.0 we’re imagining is a library that exposes all the primitives one would need to do scientific analysis at scale, with emphasis on genomics – we need the ability to represent and compute on arbitrary structured or numerical tensors, including support for distributed linear algebra, which is currently in early stages of development.

To address your specific questions:

Will Hail be supported for the next 5 years? … for how long will Hail be supported?

We can’t commit to any specific window of time without guaranteed funding, but our team has doubled in the last two years (currently ten full-time engineers), and is looking to grow further. Anything can happen, but I find it difficult to imagine that Hail would not be have maintainers for the next 5 years, and we’re looking to do much more than just maintain Hail in the coming years!

Is v0.2 trustworthy for publication work?

The answer is a resounding yes. I think a great deal of scientific software is labeled as v1.0 without having sufficient development/test infrastructure to merit a “stable general release” – nearly every time we’ve tried to write tests that compare the results of a Hail reimplementation of some common algorithm to other widely-used tools, we’ve found correctness bugs in the other tools. PLINK is an exception.

Will a reviewer give us a lot of grief due to v0.2?

Since this has happened before (and somewhat recently) to one of our Broad collaborators, the answer is definitely “possibly yes”. This is a good reason for having a public explication of our thoughts on versioning somewhere on the website, so that there is a document to point to when these questions come up.

When will the beta-test and v1 be released?

I don’t have a good estimate for that right now. I do think, however, that it’s as likely that Hail 0.3 or 1.0 come after 0.2. This probably depends on how things are going in about a year, and what the linear algebra system looks like at that point. 0.2 remains the stable release in development for the time being, though.

How intensive is the support and feature development?

Support

We do most of our support either through this forum (my preference because posts are very easy to find on search engines like Google) or through our Zulip chatroom. I don’t think there have been complaints about lack of support, whether from Broad collaborators, external academic users, or industry users. We do plan to build a dedicated support organization within the team, but don’t have a timeline for that.

Feature development

As above, we have a large group committing dozens of code changes per week to the project. A vast majority of these changes are to underlying infrastructure, which are building toward two major features:

improved performance. If we want any hope of analyzing the 1M WGS datasets that will be here in a couple of years, we need to increase Hail’s performance by at least an order of magnitude. However, we’re not hard-coding some specific algorithms that are easy to optimize independently – we need to build a super fast general-purpose data-processing system, and that’s much harder.
distributed linear algebra. This will unlock huge body of applications, both to genomics and other domains in biology (single-cell RNA-seq, for instance). When this is done, methods development for things like PC-Relate, which is currently thousands of lines of R and C code, can be ~5 lines of linear algebra in Hail – and it will benefit from Hail’s scalability for free.

UKBB / GWAS / WES analysis examples

This is a tough one, since the development team doesn’t really have complete insight into everything people are doing with Hail. We’d love if people shared all their code, but this seems not to happen – a loss for reproducibility. I’ll reach out to some of the people who I know are doing this kind of analysis in the coming weeks, and see if there’s a way we can get their code on GitHub with a link from the Hail docs.

What Hail can’t do

Common methods not in Hail

This is definitely an incomplete list (I may edit the post to add more later), but the most important missing features in Hail right now are:

mixed models. We do have a linear mixed model but it doesn’t really scale as well as we’d hoped, so things like BOLT-LMM will beat Hail. We don’t have any attempt at a logistic mixed model, and so don’t have a way to compare with SAIGE/SAIGE-GENE right now.
imputation. Will address that further below.
haplotype methods. There’s really nothing in Hail right now for haplotype-based analyses, but with good motivating examples, we could certainly add features here.
non-trio pedigrees. Hail has great support for dealing with trio data, but nothing built in to deal with larger pedigrees.

Loading a large and growing cohort into Hail

This is a good opportunity to mention that we’re not just taking other methods and reimplementing them in ways that scale – we’re also doing research to build entirely new formats, algorithms, and methods. In the linked post I reference work being done by Chris Vittal on the team to build new ways to represent and compute on large cohorts of sequenced samples. I can’t find any materials from the talk he gave at GA4GH in Basel last year, but here’s a short talk he gave as part of a Stanley Center primer on Hail last month.

There’s no other software I know of that addresses this problem acceptably, so tolerating tools in development is probably the only option! Chris is currently using Hail to transform about 150K WGS gVCFs into the format described in the video.

Imputation

Several members of Ben’s group formed an imputation working group about a year ago to think about this very problem, and concluded that the latest versions of IMPUTE and Beagle are amazing methods and that there’s no reason to try to build a competing algorithm. However, we want to make it possible to run these external tools without having to entirely leave the Hail ecosystem, and Jackie is building a pipeline execution framework in Hail that should make this process easy. It’s still early days for pipeline, though, and it’ll be a while before that’s stable.

I think that Hail can do pre-impute QC, write VCFs for upload to Sanger or U Michigan impute servers, then import imputed data for post processing and final analyses.

Yeah, this should definitely work right now. I’m not sure anybody locally has experience with this specific workflow, but I don’t see any reason why it would be difficult to build inside Hail at the moment.

Risk vs benefit and conclusions

My personal thoughts:

If you’ve got sequencing data, then there’s really no better platform than Hail, even for small data.

For genotype data, I’m not convinced Hail should be part of the pipeline, especially if you want to run mixed models. In general, the more the analysis will deviate from well-defined command-line modules, the more reason to use Hail – it’s a library that makes it quite easy to explore and understand genetic data, to figure out the right next steps for analysis. I do think, however, that some of our infrastructural improvements in the next 6-12 months will help make the experience running on genotype data much nicer.

One final point: we’d estimate that there are a few hundred active Hail users globally right now, and we do listen to that community – a benefit from being a noisy early-ish adopter is that your own group’s interests will be that much better represented in the development pipeline!

Topic		Replies	Views
Announcing Hail 0.2! Updates	2	4888	October 22, 2018
Use-case for hail at our Institute Science	7	1061	June 20, 2019
Hail curious potential user Q Help [0.1]	8	1605	March 7, 2017
Request for Feedback on Hail-Adjacent System Science	2	84	June 17, 2024
Visualization and analytics frontend Hail Query & hailctl	2	648	June 6, 2020