We’ve added a version of linear regression called linreg_multi_pheno that takes a list of phenotypes rather than a single phenotype, and is far more efficient than looping over linreg for scenarios where the set of samples is fixed over all regressions. In particular, this method only makes one pass over the data and takes greater advantage of BLAS optimization.
We took the method for a one-billion-regression-eQTL-spin on a Macbook Pro. Namely: 20k phenotypes (gene expression from Geuvadis), 50k common variants on 464 sample from the 1000 genomes project, with 15 covariates (intercept, sex, 3 PCs, 10 Peer factors). The billion multivariate regressions took about 150 seconds, whereas exporting the p-values to a 9GB compressed tsv.gz took 8 minutes more. Instead, pushing the p-values to a PySpark Dataframe and saving to Parquet format only took 1 minute more (although it was still 8GB). Of course, one might prefer to just continue analysis in PySpark and write a far smaller file of interesting results. And the same method may be run variant-parallel on thousands of cores in the cloud, making whole-genome cis and trans eQTL studies doable on the order of minutes.