Thanks for your feedback – a perspective from outside the immediate group is tremendously valuable. I’ll address two of the shorter points before starting to draft the “what Hail is good for, and what it’s not”.
UKBB is useful but atypical as we usually try to do a fussy/best possible analysis of one disease
I absolutely agree, and I think we’ve had a few people come to the forum/chatroom looking to use that code as a basis for their own single-trait GWAS. The development team would absolutely love if people shared their code more, but it seems like the incentive structure doesn’t push us in that direction. The mega-GWAS (I never know what to call the quick-and-dirty all-the-traits effort) was a well-organized team project where open code was a prescription all along. This contrasts with single-trait GWAS applications where people are writing code without the expectation of releasing it as publicly. In my experience, people (including me, occasionally) are resistant to open-sourcing closed code without spending a bit of time to clean it up, and in academia, there’s always something higher-priority than that.
My one-sentence conclusion here is that we’d love more open-source examples of what people have done with Hail, but it’s a cultural problem that we can’t change overnight.
Bibliography of papers that have used Hail
We’ve tried to assemble such a list in the past, searching for URL citations on Google Scholar. We’re still collecting our thoughts internally about software and citation and how they fit together.
What Hail is good for, what it’s not.
This document is designed for geneticists. As Hail is a general-purpose data analysis tool, we’ll need a similar but separate document pitched at people doing other kinds of analysis.
Hail vs PLINK
Hail definitely has a steeper learning curve than PLINK. Learning PLINK involves learning about what the many modules do, how to parameterize them, and how to compose them to perform a correct analysis of the data. I think it’s possible for people without much genetics experience or programming experience to use PLINK to do an acceptable genetic analysis – and that’s pretty amazing. If someone tries to use Hail without either genetics or programming experience, things are going to go badly.
Hail doesn’t include many routines with baked-in best practices. One of the best examples is the
linear_regression_rows function, which is one of the options for executing a GWAS. Hail doesn’t have a
hl.assoc function, but rather the user must specify the dependent variable (phenotype(s)), the independent variable (number of non-reference alleles in the genotype call, in the example), and any covariates, including the intercept. It’s a reasonable argument that this is a hostile user interface, but we view the interface as much safer – the user should be explicit about what he or she intends to do. If the user doesn’t have a good idea what to do with the data, Hail will force him or her to confront those areas of uncertainty instead of filling in the gaps with hard-coded best practices (which won’t always apply). This design decision doubtlessly makes it harder to use Hail, but I don’t think that’s always a bad thing.
We also very much buy the philosophy that programming is the best available medium to do data science (see Hadley Wickham present on this at ACM). Using a programming library rather than a command-line tool makes it easy and natural to compose prepackaged modules (function calls) and custom/exploratory code. Without some experience in Python, this is very difficult.
We’ve seen people just entering the field try to pick up Hail to analyze some genetic data, without much experience in genetics or programming. This has been a difficult experience for those people (requiring learning both techniques, instead of one at a time in isolation), but I think it also presents a great opportunity to develop more basic educational materials about learning statistical genetics with Hail from first principles. If your group picks up and uses Hail for a while, we would love your input here!
Performance + scalability
PLINK is substantially faster than Hail on one core for most tasks. Linear regression on dosages coming from BGEN files is one of the only examples where Hail is comparable/faster, but for nearly every other task (especially LD pruning, as of 6/17/19!), PLINK leaves Hail in the dust…on a single core. PLINK is able to multithread (use all cores in one machine) but can’t scale across machines. This means that for tasks like the UKBB mega-gwas, it’s just not an option. If PLINK is too slow for a given task, it’s likely that using Hail can perform the same task in a manageable period of time, given a large enough cluster. Note, though, that the performance difference does mean that Hail is going to be more expensive for that same task right now. Since peoples’ time is usually worth more than computers’ time, this is often an acceptable tradeoff. Additionally, we’re working hard on infrastructure to improve Hail’s performance, which will bring the single-core performance differences down, and both cheapen and improve the Hail experience.
We view Hail not as a PLINK competitor, but as an attempt to build infrastructure that can easily be used to build a PLINK-like tool for genetics, single-cell RNA-seq, or another subdomain of biology. I don’t think that our team is actually going to build those tools, but we won’t be satisfied until basically every feature in PLINK can be implemented in Hail in ~10 or fewer lines of Python using Hail, and the Hail implementations are roughly as performant.
We don’t quite have all the right library primitives right now (linear algebra is mostly missing), but it’s certainly possible to reimplement most PLINK pipelines in Hail right now, and anyone who does that is going to end up both a better programmer, and likely a better geneticist.
Hail has best-in-class support for doing sequencing data QC, and much of that benefit translates to genotype data. While it might seem on first glance that a lot of the general-purpose aggregation functionality isn’t necessary here, it’s very useful to be able to take some computation that computes a statistic, and move that code inside a
group_by aggregator to compute that metric split by cases+controls, sex, ancestry, or all the above.
It’s easy to compute PCs in-sample in Hail, but it’s also easy to import the gnomAD variant loadings to project samples onto gnomAD PCs. Hail doesn’t have any clustering methods yet, but the Python environment solves that for us – it’s just a few lines of code to localize data and put PCs + ancestry in a classification algorithm in
I’m not splitting this out by topic, but if there are specific ones you’re interested in (besides the ones listed like missingness + frequency distributions, which are all super well supported in Hail), I’m happy to go into more detail.
The maturity of analysis components varies wildly. I’ll list several here:
Linear mixed models: Immature
Hail’s LMMs don’t scale especially well. You’re probably better off running BOLT-LMM.
Logistic mixed models: Absent
Hail doesn’t have a logistic mixed model implementation.
We have IBD, GRM, and PC-relate implementations in Hail.
Rare variant burden tests (no kernel): Mature
Hail makes it extremely easy to write a variety of burden tests, burden tests are really just a composition of modular pieces of Hail:, group by gene / variant class, aggregate to count alternate alleles (or whatever other aggregation you want) to produce a gene x sample matrix, then run linear/logistic regression to test the results. Alternatively, instead of regression one could aggregate by gene to compute the number of reference/alternate alleles among cases/controls and run a Fisher’s exact test – it’s all modular!
Rare variant burden kernel tests (SKAT, SKAT-O): Immature/Absent
Hail has an implementation of SKAT, but it’s a bit brittle since it has few users. Hail does not have SKAT-O, or any other kernel-based rare variant test.
Trio-based analysis: Mature
Hail has a function called
trio_matrix that rearranges data so that all the data for each trio is contained in the same entry (cell) of the data matrix – this makes it very natural to write analyses of trio data.
Larger pedigrees: ?
I’m not sure what the standard things people do with larger pedigrees are. Some things are well-supported, like finding variants that only have mutant alleles in one family. We don’t have any algorithms for doing something like tracing inheritance through a pedigree, though.
As above, we don’t want to try to reinvent the wheel with all the great imputation software out right now.
I’m missing a lot of analyses from the above list, but this is a good place to start!
Feel free to provide more discussion points (methods, etc) and I’ll try to comment on those as well.