Fast linear regression for multiple phenotypes, eQTLs


We’ve added a version of linear regression called linreg_multi_pheno that takes a list of phenotypes rather than a single phenotype, and is far more efficient than looping over linreg for scenarios where the set of samples is fixed over all regressions. In particular, this method only makes one pass over the data and takes greater advantage of BLAS optimization.

We took the method for a one-billion-regression-eQTL-spin on a Macbook Pro. Namely: 20k phenotypes (gene expression from Geuvadis), 50k common variants on 464 sample from the 1000 genomes project, with 15 covariates (intercept, sex, 3 PCs, 10 Peer factors). The billion multivariate regressions took about 150 seconds, whereas exporting the p-values to a 9GB compressed tsv.gz took 8 minutes more. Instead, pushing the p-values to a PySpark Dataframe and saving to Parquet format only took 1 minute more (although it was still 8GB). Of course, one might prefer to just continue analysis in PySpark and write a far smaller file of interesting results. And the same method may be run variant-parallel on thousands of cores in the cloud, making whole-genome cis and trans eQTL studies doable on the order of minutes.


Hey Jon,

This sounds really cool! If one wanted to use more variants, but focus on a cis-region around each gene, would that play nice with the existing Hail framework? I’ve run some initial experiments where I naively use filter_intervals, but that seems to slow things down considerably relative to the performance you’re seeing here.



@DylanVrana did something similar (eQTLs in 1MB window around each gene) – I’m not sure how exactly his stuff went, but he would be able to say more.


Hi @davidkelley,

I have been using filter_intervals for my cis-eQTL-style analysis (across GTEx tissues, so ~36 bn regressions). I’m not sure of the performance costs of filter_intervals (@tpoterba probably knows better than I do- any thoughts about it?). My filter_intervals based cis-eQTL analyses take between 30 minutes and 2 hours per tissue (~1.2 bn regressions, 100-200 samples) on a 6-worker Dataproc cluster, depending on the number of samples and how often the workers get preempted away.


Can you share your code to do that? I kinda remember you doing some union stuff to parallelize among intervals.


We really like the linreg_multi_pheno function. Is there an equivalent logreg function for multiple phenotypes with categorical values in the cooking or already done?



Sorry, missed this last week. We should definitely have this, and I think it’s something we could expect to be in 0.2.

Part of the reason for linreg_multi_pheno is the speedup from vectorizing the matrix multiplies, though, and logistic regression won’t benefit from that boost.


Thanks. When is 0.2 expected to be out?


Can’t really say – there’s still a lot to do. My rough guess would be 3-6 months, but that was also my guess 3 months ago.


Hi @DylanVrana,

it would be really great if you could share how you feed a (large) list of intervals to filter_intervals.




Dylan’s use case was a bit special – he was doing cis-eQTL analysis, so he wanted to run regressions for variants near each gene. The natural way to do this was looping through genes, and:

  1. Annotation the expression for that gene
  2. Filtering the dataset to intervals near that gene
  3. Running linear regression and exporting results.

Dylan and I worked together to formulate a better strategy for this problem.

If you just want to filter to a dataset to a large set of intervals (say, filtering WGS data to exome capture regions), then you should use filter_variants_table. You can import an interval list into a key table with KeyTable.import_interval_list or KeyTable.import_bed.


These functions will indeed be very useful when filtering exomes, thanks.

What I’m also after is similar to Dylan’s situation, I’m trying to find out if I can analyse methylation data in Hail using univariate or multivariate linear mixed models (similar to what GEMMA does We’re doing cis and trans analyses, and the way GEMMA works, we have to create hundreds of thousands of files, one for each set of variants for which we want to perform the analysis. The way Hail works would alleviate this burden, and allow us to perform the analyses by filtering based on intervals. Do you think that’s possible? Is there a way to annotate g with methylation status?


The genotype field in 0.1 is generally expected to be the GATK schema: GT, AD, DP, GQ, PL. We do have limited support for totally generic genotypes, but you can’t do much with them and everything runs much more slowly on them. Many methods are hard-coded against this schema, like linear regression (it only regresses against GT or dosage).

We expect to have a completely generic schema in 0.2 with no performance cost, which will let you annotate the methylation status and use that in linear regression, for instance.


Hi there - I am also working on a cis-eQTL analysis at the moment, but I keep getting a timeout error when I try to loop through all genes on a dataproc cluster. Did you and Dylan run into a similar issue, and if so, how did you solve it? Would appreciate any insight you have about running and optimizing an analysis like this.



can you provide the pipeline and the error messages?


This is Hail 0.2 right?


Hi - no this is 0.1. I would switch, but we’re in the middle of trying to meet a tight deadline and I’m reluctant to try to learn the new version…though I could be convinced!

Here’s the pipeline, as run on dataproc using Jupyter notebook (apologies in advance for length). The final block with the loop is where it falls apart:

# Load Hail
import hail
from hail import *
hc = HailContext()

# Load packages
import pandas as pd
import numpy as np
from math import log, isnan
import os,sys,glob,gzip
from pprint import pprint
# Load toy VDS
vds ='gs://donnawerlingucsf/devEQTL/filterPart2_toy-180402.vds')
# Copy expression data table from GS to cluster
gsutil cp gs://donnawerlingucsf/devEQTL/misc_files/counts.gene.cleaned-to-Donna-02092018-transpose.tsv counts.gene.cleaned-to-Donna-02092018-transpose.tsv
# Import sample annotation table with full expression data as pandas data frame
exprdf = pd.read_csv('counts.gene.cleaned-to-Donna-02092018-transpose.tsv', sep='\t', index_col=0)
# Samples from VDS
sampsInVds = vds.sample_ids
# Subset the expression data frame
exprdf = exprdf.loc[sampsInVds,:]
# Calculate per-gene expression sums across this subsetted sample set
exprsums = exprdf.sum(axis=0)
# Find non-expressed genes
noExpr = exprsums[exprsums.le(0)]
# Remove these genes from the expression data
exprdf = exprdf.drop(noExpr.index, axis=1)
# Write this df out to file
exprdf.to_csv(path_or_buf='expressionFilt_176s_40707g-180411.txt', sep='\t')
# Transfer file to GS
gsutil cp expressionFilt_176s_40707g-180411.txt gs://donnawerlingucsf/devEQTL/misc_files/
# Re-load this expression data object as a keyTable
exprdat = hc.import_table('gs://donnawerlingucsf/devEQTL/misc_files/expressionFilt_176s_40707g-180411.txt', impute=True).key_by('Sample')
# Add expression to sample annotations for this VDS
vds = vds.annotate_samples_table(exprdat, root='sa.exprdat')
# Import sample annotation table with sample metadata information
metadat = hc.import_table('gs://donnawerlingucsf/devEQTL/misc_files/eQTL_metadata_forDex-format-180316.txt', impute=True).key_by('Braincode')
# Add metadata to sample annotations for this VDS
vds = vds.annotate_samples_table(metadat, root='')
# Add integer-coded variable for Sex
vds = vds.annotate_samples_expr(' = if ( == "f") 1 else if ( == "m") 0 else NA: Int')
# Import gene locations file
geneloc = hc.import_table('gs://donnawerlingucsf/devEQTL/misc_files/geneLocations.geneIdForHail.gencode.v21.txt', impute=True).key_by('geneId')
# Convert geneloc keyTable to pandas dataframe
genedf = geneloc.to_pandas()
# Run the full loop, on all 40707 genes
res_table = pd.DataFrame()
for i in range (0, len(expr_genes)-1):
    gene = '`' + expr_genes[i] + '`'
    testPheno = 'sa.exprdat.' + gene
    chrVar = list(genedf.loc[genedf['geneId'] == gene]['chr'])[0]
    startVar = list(genedf.loc[genedf['geneId'] == gene]['start'])[0]
    endVar = list(genedf.loc[genedf['geneId'] == gene]['end'])[0]
    # Run linear regression for this gene
    vds_res = vds.filter_intervals(Interval(Locus(chrVar, startVar-1000000), Locus(chrVar, endVar+1000000)))\
    .linreg(testPheno, covariates=['',''])
    # Pull out the results for this cis-eQTL analysis, filter to p<0.05, rename columns, convert to pandas data frame
    this_table = vds_res.variants_table().flatten()\
    .filter("`va.linreg.pval` > 0.05", keep=False)\
    .annotate('Variant = str(Variant)') \
    # Check - are there any significant results for this gene?
    nSig = this_table.size
    if nSig >= 1:
        # Add gene name to this table
        this_table['GeneId'] = gene.replace('`','')
        this_table[['EnsemblID','GeneSymbol']] = this_table['GeneId'].str.split('|',expand=True)

        # Split up the information in the variant column, re-order columns
        this_table[['Chr','Start','Ref','Alt']] = this_table['Variant'].str.split(':',expand=True)
        this_table = this_table[['GeneId','EnsemblID','GeneSymbol','Variant','Chr','Start','Ref','Alt','rsID','Beta','SE','Tstat','P']].head()

        # Add results to existing master results table
        res_table = pd.concat([res_table,this_table])


And here is the error message I see, after the loop has processed about 9k of the 40k genes in the analysis:

timeoutTraceback (most recent call last)
<ipython-input-36-f1e28f299135> in <module>()
     29     # Pull out the results for this cis-eQTL analysis, filter to p<0.05, rename columns, convert to pandas data frame
---> 30     this_table = vds_res.variants_table().flatten()    .select(['v','va.rsid','va.linreg.beta','','va.linreg.tstat','va.linreg.pval'])    .filter("`va.linreg.pval` > 0.05", keep=False)    .rename(['Variant','rsID','Beta','SE','Tstat','P'])    .annotate('Variant = str(Variant)')     .to_pandas(expand=False)
     32     # Check - are there any significant results for this gene?

<decorator-gen-253> in to_pandas(self, expand, flatten)

/home/hail/ in handle_py4j(func, *args, **kwargs)
    113 def handle_py4j(func, *args, **kwargs):
    114     try:
--> 115         r = func(*args, **kwargs)
    116     except py4j.protocol.Py4JJavaError as e:
    117         tpl = Env.jutils().handleForPython(e.java_exception)

<decorator-gen-252> in to_pandas(self, expand, flatten)

/home/hail/ in _typecheck(f, *args, **kwargs)
    243     def _typecheck(f, *args, **kwargs):
    244         check_all(f, args, kwargs, checkers, is_method=True)
--> 245         return f(*args, **kwargs)
    247     return decorator(_typecheck)

/home/hail/ in to_pandas(self, expand, flatten)
    642         """
--> 644         return self.to_dataframe(expand, flatten).toPandas()
    646     @handle_py4j

/usr/lib/spark/python/pyspark/sql/dataframe.pyc in toPandas(self)
   1440         """
   1441         import pandas as pd
-> 1442         return pd.DataFrame.from_records(self.collect(), columns=self.columns)
   1444     ##########################################################################################

/usr/lib/spark/python/pyspark/sql/dataframe.pyc in collect(self)
    309         with SCCallSiteSync(self._sc) as css:
    310             port = self._jdf.collectToPython()
--> 311         return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
    313     @ignore_unicode_prefix

/usr/lib/spark/python/pyspark/rdd.pyc in _load_from_socket(port, serializer)
    140     try:
    141         rf = sock.makefile("rb", 65536)
--> 142         for item in serializer.load_stream(rf):
    143             yield item
    144     finally:

/usr/lib/spark/python/pyspark/serializers.pyc in load_stream(self, stream)
    137         while True:
    138             try:
--> 139                 yield self._read_with_length(stream)
    140             except EOFError:
    141                 return

/usr/lib/spark/python/pyspark/serializers.pyc in _read_with_length(self, stream)
    155     def _read_with_length(self, stream):
--> 156         length = read_int(stream)
    157         if length == SpecialLengths.END_OF_DATA_SECTION:
    158             raise EOFError

/usr/lib/spark/python/pyspark/serializers.pyc in read_int(stream)
    542 def read_int(stream):
--> 543     length =
    544     if not length:
    545         raise EOFError

/usr/lib/python2.7/socket.pyc in read(self, size)
    378                 # fragmentation issues on many platforms.
    379                 try:
--> 380                     data = self._sock.recv(left)
    381                 except error, e:
    382                     if e.args[0] == EINTR:

timeout: timed out

I would also note that this loop actually runs to completion on my laptop, in about 3 hours (this is on a variant-downsampled VDS). The timeout issue only happens when I try to run the same analysis on a dataproc cluster.


Thanks for the detail! That’ll make this a lot easier to figure out.

This looks similar:

I think it might also make things a bit easier if you export the linreg results to a file instead of bringing back to pandas, and then concatenate everything later. That way, if things fail again, you can restart from the point where it failed!

I’ll try to find time for a reply commensurate with your detailed report, but it might not be until Friday.