We’ve added the method linreg_burden to provide a very general framework for aggregating (collapsing) genotypes and then applying linear regression. This method returns both a key table of linear regression results and a key table of collapsed sample scores for those samples with non-missing phenotype and covariates. The latter can then be further analyzed in PySpark or locally in tools like Python and R. Click the method link above for more details.
We ran a benchmark on the cloud starting from the hardcall VDS of the 1000 Genomes Project, consisting of 2535 samples and 37.2 million variants with 1000 partitions. We generated a random phenotype and two random sample covariates and annotated each variant by the set of genes in which that variant resides, using a quick and dirty interval list of 18338 genes. We then computed a score per gene per sample as a weighted sum of genotypes, using a random variant annotation va.weight
.
With 344 cores, running the linear burden test and exporting both the regression results and (sample, gene) aggregated values to TSVs took about two minutes. Here’s a local version of the timed command:
linreg_kt, sample_kt = (hc.read('1kg_burden.vds')
.linreg_burden(key_name='gene',
variant_keys='va.genes',
single_key=False,
agg_expr='gs.map(g => va.weight * g.gt).sum()',
y='sa.pheno',
covariates=['sa.cov1', 'sa.cov2']))
linreg_kt.export('linreg_table.tsv')
sample_kt.export('sample_table.tsv')