Big picture issues: considering switching to HAIL

Hi.

I lead genomic consortia that need to scale for larger Ns, all types of variation, and to escape limits of brick-and-mortar clusters. We’ve been following HAIL for 2+ years, certainly see its utility, and are super impressed with what we’ve seen and with the work from (eg) UKBB, SCHEMA, etc. From talking w Ben and Mark, I’m aware that multiple large projects are now entirely HAIL-based.

Switching over to HAIL seems to be more a matter of “when” than “whether”. We are debating if now is the time. I am hoping that I can get some big picture insights to help in this decision. Shifting from current pipelines to HAIL is a big deal: we have pipelines that work but we’d have to retrain our analyst staff, redo existing QC, etc. I am hoping to get more info to assist with our deliberations.

First, HAIL is currently v0.2 with a copyright date of 2016 (https://hail.is/docs/0.2/index.html). The impression is thus that this might be a chancy switch (versions under 0.9 are often buggy or partial). From discussions with colleagues, I gather that this is incorrect view, that the view is clear internal to Broad but not clear to an external user. For instance, will HAIL be supported for the next 5 years? Is v0.2 trustworthy for publication work? Will a reviewer give us a lot of grief due to v0.2? When will the beta-test and v1 be released? How intensive is the support and feature development? Crucially - for how long will HAIL be supported?

Second, have looked at the help page and the great UKBB blogs. UKBB is an atypical example, and I know there are examples of analysis of smallish datasets. This is great to illustrate features. It would be really helpful to see a library of real-world, start-to-finish analyses - notebooks for GWAS and WES analysis of real data from soup-to-nuts. As you state, “Hail’s How-To Guides are in their early stages. We welcome suggestions for additional guides, as well as feedback about our documentation”, so here’s a suggestion!

Third, need to know what HAIL cannot now do. For example, a basic function important for a consortium is under development, see Loading a large and growing cohort into Hail.

Fourth, related, is imputation to 1000G, HRC, etc (search of HAIL.is only found “impute_sex” and the like). Ben told me that this was a current limitation. Imputation is obv fundamental to consortia to align different data from different SNP arrays. I think that HAIL can do pre-impute QC, write VCFs for upload to Sanger or U Michigan impute servers, then import imputed data for post processing and final analyses. Is this the path? Is there much experience with this?

Just to be sure that this point is not lost (and integrating conversations with Ben, Brendan, and others), this is an extremely impressive effort that is of immense benefit to genomics - wonderful to see people identify a need and take comprehensive, highly informed, and robust efforts to fill a critical void.

My question is really about risk : benefit. Is it the time for me and mine to make the jump? Any help anyone might offer would be greatly appreciated. Best - PF Sullivan

I think this is a good discussion to have, and am happy to go in-depth here. Many of these questions have come up before, and I imagine that we’ll want to take some of this material and copy it to a FAQ section on the website.

I’ll try to address each area of questions sequentially.

Hail maintenance and versioning

You’re not the first to be concerned about using software designated with an initial development (0.X) version. Despite the “semantic” in semantic versioning, there is some heterogeneity in what people are trying to communicate using versioning schemes. For instance, the de facto Python library for tabular data analysis is pandas, which is on its 0.24 release – still not 1.0. With respect to Hail:

  • interfaces are stable within the major (0.X) versions.
  • we have had 2 major versions so far (0.1 and 0.2). We don’t anticipate 0.3 being released in the next 6-9 months, because it’s possible to develop the backend without changing the 0.2 interfaces for a while.
  • Every commit (30-60 per week) is subject to code review and must pass all automated tests.
  • Bugs are fixed aggressively, and a list of fixed bugs is found in the change log on the website.

So why isn’t Hail in 1.0? We don’t view Hail as “buggy”, but we do view it as “partial”. Our vision for a 1.0 release is far broader than 0.1 or 0.2. Hail 0.1 was a library that was well-suited for processing high-throughput sequencing VCF-like data. 0.2 is more general-purpose, useful for processing 2-dimensional structured data, with special focus on genomics. But the 1.0 we’re imagining is a library that exposes all the primitives one would need to do scientific analysis at scale, with emphasis on genomics – we need the ability to represent and compute on arbitrary structured or numerical tensors, including support for distributed linear algebra, which is currently in early stages of development.

To address your specific questions:

Will Hail be supported for the next 5 years? … for how long will Hail be supported?

We can’t commit to any specific window of time without guaranteed funding, but our team has doubled in the last two years (currently ten full-time engineers), and is looking to grow further. Anything can happen, but I find it difficult to imagine that Hail would not be have maintainers for the next 5 years, and we’re looking to do much more than just maintain Hail in the coming years!

Is v0.2 trustworthy for publication work?

The answer is a resounding yes. I think a great deal of scientific software is labeled as v1.0 without having sufficient development/test infrastructure to merit a “stable general release” – nearly every time we’ve tried to write tests that compare the results of a Hail reimplementation of some common algorithm to other widely-used tools, we’ve found correctness bugs in the other tools. PLINK is an exception.

Will a reviewer give us a lot of grief due to v0.2?

Since this has happened before (and somewhat recently) to one of our Broad collaborators, the answer is definitely “possibly yes”. This is a good reason for having a public explication of our thoughts on versioning somewhere on the website, so that there is a document to point to when these questions come up.

When will the beta-test and v1 be released?

I don’t have a good estimate for that right now. I do think, however, that it’s as likely that Hail 0.3 or 1.0 come after 0.2. This probably depends on how things are going in about a year, and what the linear algebra system looks like at that point. 0.2 remains the stable release in development for the time being, though.

How intensive is the support and feature development?

Support

We do most of our support either through this forum (my preference because posts are very easy to find on search engines like Google) or through our Zulip chatroom. I don’t think there have been complaints about lack of support, whether from Broad collaborators, external academic users, or industry users. We do plan to build a dedicated support organization within the team, but don’t have a timeline for that.

Feature development

As above, we have a large group committing dozens of code changes per week to the project. A vast majority of these changes are to underlying infrastructure, which are building toward two major features:

  • improved performance. If we want any hope of analyzing the 1M WGS datasets that will be here in a couple of years, we need to increase Hail’s performance by at least an order of magnitude. However, we’re not hard-coding some specific algorithms that are easy to optimize independently – we need to build a super fast general-purpose data-processing system, and that’s much harder.
  • distributed linear algebra. This will unlock huge body of applications, both to genomics and other domains in biology (single-cell RNA-seq, for instance). When this is done, methods development for things like PC-Relate, which is currently thousands of lines of R and C code, can be ~5 lines of linear algebra in Hail – and it will benefit from Hail’s scalability for free.

UKBB / GWAS / WES analysis examples

This is a tough one, since the development team doesn’t really have complete insight into everything people are doing with Hail. We’d love if people shared all their code, but this seems not to happen – a loss for reproducibility. I’ll reach out to some of the people who I know are doing this kind of analysis in the coming weeks, and see if there’s a way we can get their code on GitHub with a link from the Hail docs.

What Hail can’t do

Common methods not in Hail

This is definitely an incomplete list (I may edit the post to add more later), but the most important missing features in Hail right now are:

  • mixed models. We do have a linear mixed model but it doesn’t really scale as well as we’d hoped, so things like BOLT-LMM will beat Hail. We don’t have any attempt at a logistic mixed model, and so don’t have a way to compare with SAIGE/SAIGE-GENE right now.
  • imputation. Will address that further below.
  • haplotype methods. There’s really nothing in Hail right now for haplotype-based analyses, but with good motivating examples, we could certainly add features here.
  • non-trio pedigrees. Hail has great support for dealing with trio data, but nothing built in to deal with larger pedigrees.

Loading a large and growing cohort into Hail

This is a good opportunity to mention that we’re not just taking other methods and reimplementing them in ways that scale – we’re also doing research to build entirely new formats, algorithms, and methods. In the linked post I reference work being done by Chris Vittal on the team to build new ways to represent and compute on large cohorts of sequenced samples. I can’t find any materials from the talk he gave at GA4GH in Basel last year, but here’s a short talk he gave as part of a Stanley Center primer on Hail last month.

There’s no other software I know of that addresses this problem acceptably, so tolerating tools in development is probably the only option! Chris is currently using Hail to transform about 150K WGS gVCFs into the format described in the video.

Imputation

Several members of Ben’s group formed an imputation working group about a year ago to think about this very problem, and concluded that the latest versions of IMPUTE and Beagle are amazing methods and that there’s no reason to try to build a competing algorithm. However, we want to make it possible to run these external tools without having to entirely leave the Hail ecosystem, and Jackie is building a pipeline execution framework in Hail that should make this process easy. It’s still early days for pipeline, though, and it’ll be a while before that’s stable.

I think that Hail can do pre-impute QC, write VCFs for upload to Sanger or U Michigan impute servers, then import imputed data for post processing and final analyses.

Yeah, this should definitely work right now. I’m not sure anybody locally has experience with this specific workflow, but I don’t see any reason why it would be difficult to build inside Hail at the moment.

Risk vs benefit and conclusions

My personal thoughts:

If you’ve got sequencing data, then there’s really no better platform than Hail, even for small data.

For genotype data, I’m not convinced Hail should be part of the pipeline, especially if you want to run mixed models. In general, the more the analysis will deviate from well-defined command-line modules, the more reason to use Hail – it’s a library that makes it quite easy to explore and understand genetic data, to figure out the right next steps for analysis. I do think, however, that some of our infrastructural improvements in the next 6-12 months will help make the experience running on genotype data much nicer.

One final point: we’d estimate that there are a few hundred active Hail users globally right now, and we do listen to that community – a benefit from being a noisy early-ish adopter is that your own group’s interests will be that much better represented in the development pipeline!

Tim - thank you so much for the comprehensive and thoughtful reply. I greatly appreciate the effort it took. This definitely helps, and I agree that it would be great in a FAQ.

Four things:

  • agree that a code library for analyses of real-world datasets would be great. UKBB is useful but atypical as we usually try to do a fussy/best possible analysis of one disease (see Daniel H’s UKBB post on the necessary compromises when confronting 1000s of traits). Would help for learning and reproducibility (as you point out)

  • would this be useful? is the long-term plan for HAIL akin to the dev process of GATK? Was a Broad-led dev team that began out of a need to master a new datatype, and has been consistently supported for the better part of 10 years?

  • I agree that version is ambiguous: the CS meaning may differ from that of a user used to commercial versioning (mac OS or a word processing program). Perhaps more importantly is a list of tasks and a gauge of how mature HAIL now is for that task. Example. Main header is case-control WGS, sub headings for QC and analysis. Could list several steps within each (evaluation of SNV and subject missingness, freq distribution, ancestry estimation, generation of graphs to understand impact of AF on missingness). Then, each with a readiness or maturity estimation. If not mature, a guess when it might be. Could do the same for SNP arrays. Could use this to sketch out the sharable parts of what you have planned (the PC-Relate stuff was cool).

  • agree that a bibliography of papers that have used HAIL would be super useful

In essence, the stuff I propose above is really just outward facing PR. But non-trivial - it would really help people understand what HAIL can now do and whether it’s right for them. IMO? Your reply really impressed me, so we’re going to test it out.

my thanks! PF Sullivan

Thanks for your feedback – a perspective from outside the immediate group is tremendously valuable. I’ll address two of the shorter points before starting to draft the “what Hail is good for, and what it’s not”.


UKBB is useful but atypical as we usually try to do a fussy/best possible analysis of one disease

I absolutely agree, and I think we’ve had a few people come to the forum/chatroom looking to use that code as a basis for their own single-trait GWAS. The development team would absolutely love if people shared their code more, but it seems like the incentive structure doesn’t push us in that direction. The mega-GWAS (I never know what to call the quick-and-dirty all-the-traits effort) was a well-organized team project where open code was a prescription all along. This contrasts with single-trait GWAS applications where people are writing code without the expectation of releasing it as publicly. In my experience, people (including me, occasionally) are resistant to open-sourcing closed code without spending a bit of time to clean it up, and in academia, there’s always something higher-priority than that.

My one-sentence conclusion here is that we’d love more open-source examples of what people have done with Hail, but it’s a cultural problem that we can’t change overnight.

Bibliography of papers that have used Hail

We’ve tried to assemble such a list in the past, searching for URL citations on Google Scholar. We’re still collecting our thoughts internally about software and citation and how they fit together.


What Hail is good for, what it’s not.

This document is designed for geneticists. As Hail is a general-purpose data analysis tool, we’ll need a similar but separate document pitched at people doing other kinds of analysis.

Hail vs PLINK

Target audience

Hail definitely has a steeper learning curve than PLINK. Learning PLINK involves learning about what the many modules do, how to parameterize them, and how to compose them to perform a correct analysis of the data. I think it’s possible for people without much genetics experience or programming experience to use PLINK to do an acceptable genetic analysis – and that’s pretty amazing. If someone tries to use Hail without either genetics or programming experience, things are going to go badly.

Hail doesn’t include many routines with baked-in best practices. One of the best examples is the linear_regression_rows function, which is one of the options for executing a GWAS. Hail doesn’t have a hl.assoc function, but rather the user must specify the dependent variable (phenotype(s)), the independent variable (number of non-reference alleles in the genotype call, in the example), and any covariates, including the intercept. It’s a reasonable argument that this is a hostile user interface, but we view the interface as much safer – the user should be explicit about what he or she intends to do. If the user doesn’t have a good idea what to do with the data, Hail will force him or her to confront those areas of uncertainty instead of filling in the gaps with hard-coded best practices (which won’t always apply). This design decision doubtlessly makes it harder to use Hail, but I don’t think that’s always a bad thing.

We also very much buy the philosophy that programming is the best available medium to do data science (see Hadley Wickham present on this at ACM). Using a programming library rather than a command-line tool makes it easy and natural to compose prepackaged modules (function calls) and custom/exploratory code. Without some experience in Python, this is very difficult.

We’ve seen people just entering the field try to pick up Hail to analyze some genetic data, without much experience in genetics or programming. This has been a difficult experience for those people (requiring learning both techniques, instead of one at a time in isolation), but I think it also presents a great opportunity to develop more basic educational materials about learning statistical genetics with Hail from first principles. If your group picks up and uses Hail for a while, we would love your input here!

Performance + scalability

PLINK is substantially faster than Hail on one core for most tasks. Linear regression on dosages coming from BGEN files is one of the only examples where Hail is comparable/faster, but for nearly every other task (especially LD pruning, as of 6/17/19!), PLINK leaves Hail in the dust…on a single core. PLINK is able to multithread (use all cores in one machine) but can’t scale across machines. This means that for tasks like the UKBB mega-gwas, it’s just not an option. If PLINK is too slow for a given task, it’s likely that using Hail can perform the same task in a manageable period of time, given a large enough cluster. Note, though, that the performance difference does mean that Hail is going to be more expensive for that same task right now. Since peoples’ time is usually worth more than computers’ time, this is often an acceptable tradeoff. Additionally, we’re working hard on infrastructure to improve Hail’s performance, which will bring the single-core performance differences down, and both cheapen and improve the Hail experience.

Final notes

We view Hail not as a PLINK competitor, but as an attempt to build infrastructure that can easily be used to build a PLINK-like tool for genetics, single-cell RNA-seq, or another subdomain of biology. I don’t think that our team is actually going to build those tools, but we won’t be satisfied until basically every feature in PLINK can be implemented in Hail in ~10 or fewer lines of Python using Hail, and the Hail implementations are roughly as performant.

We don’t quite have all the right library primitives right now (linear algebra is mostly missing), but it’s certainly possible to reimplement most PLINK pipelines in Hail right now, and anyone who does that is going to end up both a better programmer, and likely a better geneticist.

Case-Control GWAS

QC: Mature

Hail has best-in-class support for doing sequencing data QC, and much of that benefit translates to genotype data. While it might seem on first glance that a lot of the general-purpose aggregation functionality isn’t necessary here, it’s very useful to be able to take some computation that computes a statistic, and move that code inside a group_by aggregator to compute that metric split by cases+controls, sex, ancestry, or all the above.

It’s easy to compute PCs in-sample in Hail, but it’s also easy to import the gnomAD variant loadings to project samples onto gnomAD PCs. Hail doesn’t have any clustering methods yet, but the Python environment solves that for us – it’s just a few lines of code to localize data and put PCs + ancestry in a classification algorithm in sklearn.

Running Hail in a Python notebook makes it possible to leverage some of the interactive (javascript) plotting libraries developed in the last several years to make inline plots that can be panned, zoomed, and hovered over to show, for instance, the sample ID of an outlier.

I’m not splitting this out by topic, but if there are specific ones you’re interested in (besides the ones listed like missingness + frequency distributions, which are all super well supported in Hail), I’m happy to go into more detail.

Analysis

The maturity of analysis components varies wildly. I’ll list several here:

Linear mixed models: Immature

Hail’s LMMs don’t scale especially well. You’re probably better off running BOLT-LMM.

Logistic mixed models: Absent

Hail doesn’t have a logistic mixed model implementation.

Relatedness: Mature

We have IBD, GRM, and PC-relate implementations in Hail.

Rare variant burden tests (no kernel): Mature

Hail makes it extremely easy to write a variety of burden tests, burden tests are really just a composition of modular pieces of Hail:, group by gene / variant class, aggregate to count alternate alleles (or whatever other aggregation you want) to produce a gene x sample matrix, then run linear/logistic regression to test the results. Alternatively, instead of regression one could aggregate by gene to compute the number of reference/alternate alleles among cases/controls and run a Fisher’s exact test – it’s all modular!

Rare variant burden kernel tests (SKAT, SKAT-O): Immature/Absent

Hail has an implementation of SKAT, but it’s a bit brittle since it has few users. Hail does not have SKAT-O, or any other kernel-based rare variant test.

Trio-based analysis: Mature

Hail has a function called trio_matrix that rearranges data so that all the data for each trio is contained in the same entry (cell) of the data matrix – this makes it very natural to write analyses of trio data.

Larger pedigrees: ?

I’m not sure what the standard things people do with larger pedigrees are. Some things are well-supported, like finding variants that only have mutant alleles in one family. We don’t have any algorithms for doing something like tracing inheritance through a pedigree, though.

Imputation: Absent

As above, we don’t want to try to reinvent the wheel with all the great imputation software out right now.


I’m missing a lot of analyses from the above list, but this is a good place to start!

Feel free to provide more discussion points (methods, etc) and I’ll try to comment on those as well.

Again, thanks for the extensive and helpful replies (as one of my colleagues said, we usually only get some version of “RTFM”). You have fully addressed my query.

About reproducibility. I think that someone needs to write a Nat Gen opinion piece on this. I think this should be enforced, that a complete description of all steps of an analysis should be in supplemental methods for all genomics papers (script, markdown, notebook). Your UKBB work is an example. I ran across another example recently, Alex Urban’s group did a comprehensive analysis of K562 cells and published their full script (https://genome.cshlp.org/content/suppl/2019/02/08/gr.234948.118.DC1/Supplemental_Analysis_code.txt)

my thanks! pfs