Linear regression burden tests, collapsing genotypes by variant key

jbloom · April 25, 2017, 10:00pm

We’ve added the method linreg_burden to provide a very general framework for aggregating (collapsing) genotypes and then applying linear regression. This method returns both a key table of linear regression results and a key table of collapsed sample scores for those samples with non-missing phenotype and covariates. The latter can then be further analyzed in PySpark or locally in tools like Python and R. Click the method link above for more details.

We ran a benchmark on the cloud starting from the hardcall VDS of the 1000 Genomes Project, consisting of 2535 samples and 37.2 million variants with 1000 partitions. We generated a random phenotype and two random sample covariates and annotated each variant by the set of genes in which that variant resides, using a quick and dirty interval list of 18338 genes. We then computed a score per gene per sample as a weighted sum of genotypes, using a random variant annotation va.weight.

With 344 cores, running the linear burden test and exporting both the regression results and (sample, gene) aggregated values to TSVs took about two minutes. Here’s a local version of the timed command:

linreg_kt, sample_kt = (hc.read('1kg_burden.vds')
    .linreg_burden(key_name='gene',
                   variant_keys='va.genes',
                   single_key=False,
                   agg_expr='gs.map(g => va.weight * g.gt).sum()',
                   y='sa.pheno',
                   covariates=['sa.cov1', 'sa.cov2']))

linreg_kt.export('linreg_table.tsv')
sample_kt.export('sample_table.tsv')

jbloom · October 4, 2018, 5:34pm

This post is deprecated. Burden test functionality is now fully generic in Hail 0.2, example here

Topic		Replies	Views
Logistic regression burden tests Updates	4	3161	September 12, 2018
Linear regression hanging - help needed Hail Query & hailctl	9	506	August 7, 2023
Linear regression based on genotype-summary statistics Hail Query & hailctl	2	464	August 7, 2020
Fast linear regression for multiple phenotypes, eQTLs Updates	24	3299	July 17, 2018
Simple rare variant burden testing with Fisher exact test Help [0.1]	0	1877	May 12, 2017

Linear regression burden tests, collapsing genotypes by variant key

Related topics