Fast linear regression for multiple phenotypes, eQTLs


We’ve added a version of linear regression called linreg_multi_pheno that takes a list of phenotypes rather than a single phenotype, and is far more efficient than looping over linreg for scenarios where the set of samples is fixed over all regressions. In particular, this method only makes one pass over the data and takes greater advantage of BLAS optimization.

We took the method for a one-billion-regression-eQTL-spin on a Macbook Pro. Namely: 20k phenotypes (gene expression from Geuvadis), 50k common variants on 464 sample from the 1000 genomes project, with 15 covariates (intercept, sex, 3 PCs, 10 Peer factors). The billion multivariate regressions took about 150 seconds, whereas exporting the p-values to a 9GB compressed tsv.gz took 8 minutes more. Instead, pushing the p-values to a PySpark Dataframe and saving to Parquet format only took 1 minute more (although it was still 8GB). Of course, one might prefer to just continue analysis in PySpark and write a far smaller file of interesting results. And the same method may be run variant-parallel on thousands of cores in the cloud, making whole-genome cis and trans eQTL studies doable on the order of minutes.


Hey Jon,

This sounds really cool! If one wanted to use more variants, but focus on a cis-region around each gene, would that play nice with the existing Hail framework? I’ve run some initial experiments where I naively use filter_intervals, but that seems to slow things down considerably relative to the performance you’re seeing here.



@DylanVrana did something similar (eQTLs in 1MB window around each gene) – I’m not sure how exactly his stuff went, but he would be able to say more.


Hi @davidkelley,

I have been using filter_intervals for my cis-eQTL-style analysis (across GTEx tissues, so ~36 bn regressions). I’m not sure of the performance costs of filter_intervals (@tpoterba probably knows better than I do- any thoughts about it?). My filter_intervals based cis-eQTL analyses take between 30 minutes and 2 hours per tissue (~1.2 bn regressions, 100-200 samples) on a 6-worker Dataproc cluster, depending on the number of samples and how often the workers get preempted away.


Can you share your code to do that? I kinda remember you doing some union stuff to parallelize among intervals.


We really like the linreg_multi_pheno function. Is there an equivalent logreg function for multiple phenotypes with categorical values in the cooking or already done?



Sorry, missed this last week. We should definitely have this, and I think it’s something we could expect to be in 0.2.

Part of the reason for linreg_multi_pheno is the speedup from vectorizing the matrix multiplies, though, and logistic regression won’t benefit from that boost.


Thanks. When is 0.2 expected to be out?


Can’t really say – there’s still a lot to do. My rough guess would be 3-6 months, but that was also my guess 3 months ago.


Hi @DylanVrana,

it would be really great if you could share how you feed a (large) list of intervals to filter_intervals.




Dylan’s use case was a bit special – he was doing cis-eQTL analysis, so he wanted to run regressions for variants near each gene. The natural way to do this was looping through genes, and:

  1. Annotation the expression for that gene
  2. Filtering the dataset to intervals near that gene
  3. Running linear regression and exporting results.

Dylan and I worked together to formulate a better strategy for this problem.

If you just want to filter to a dataset to a large set of intervals (say, filtering WGS data to exome capture regions), then you should use filter_variants_table. You can import an interval list into a key table with KeyTable.import_interval_list or KeyTable.import_bed.


These functions will indeed be very useful when filtering exomes, thanks.

What I’m also after is similar to Dylan’s situation, I’m trying to find out if I can analyse methylation data in Hail using univariate or multivariate linear mixed models (similar to what GEMMA does We’re doing cis and trans analyses, and the way GEMMA works, we have to create hundreds of thousands of files, one for each set of variants for which we want to perform the analysis. The way Hail works would alleviate this burden, and allow us to perform the analyses by filtering based on intervals. Do you think that’s possible? Is there a way to annotate g with methylation status?


The genotype field in 0.1 is generally expected to be the GATK schema: GT, AD, DP, GQ, PL. We do have limited support for totally generic genotypes, but you can’t do much with them and everything runs much more slowly on them. Many methods are hard-coded against this schema, like linear regression (it only regresses against GT or dosage).

We expect to have a completely generic schema in 0.2 with no performance cost, which will let you annotate the methylation status and use that in linear regression, for instance.