For our first method of generating large sythetic datasets, Hail now provides a distributed implementation of the Balding-Nichols model. This model captures basic population structure as described in detail in the docs here.
As a simple example of data science, let’s generate a genetic dataset, filter to common variants, generate random sample covariates and a phenotype that also depends on the population label, run linear regression without and then with principal component covariates, and export the results:
from hail import *
hc = HailContext()
vds = (hc.balding_nichols_model(3, 1000, 1000, pop_dist = [0.2, 0.3, 0.5], fst = [.05, .09, .11])
.annotate_variants_expr('va.AC = gs.map(g => g.nNonRefAlleles).sum()')
.filter_variants_expr('va.AC >= 100 && va.AC <= 1900')
.annotate_samples_expr('sa.cov1 = rnorm(0, 1), sa.cov2 = rnorm(0, 1)')
.annotate_samples_expr('sa.pheno = rnorm(1, 1) + 2 * sa.cov1 - sa.cov2 + .2 * sa.pop')
.linreg('sa.pheno', covariates = ['sa.cov1', 'sa.cov2'], root = 'va.linreg')
.pca(scores = 'sa.pca', eigenvalues = 'global.evals', k = 3)
.linreg('sa.pheno', covariates = ['sa.cov1', 'sa.cov2', 'sa.pca.PC1', 'sa.pca.PC2'], root = 'va.linreg_with_PCs'))
vds.export_samples('pca.tsv', 'PC1 = sa.pca.PC1, PC2 = sa.pca.PC2, PC3 = sa.pca.PC3, pop = sa.pop')
vds.export_variants('pvals.tsv', 'pval = va.linreg.pval, pvals_with_PCs = va.linreg_with_PCs.pval')
Here is the resulting annotation schema, which includes all the parameters used in the Balding-Nichols model:
globals: Struct {
nPops: Int,
nSamples: Int,
nVariants: Int,
popDist: Array[Double],
Fst: Array[Double],
ancestralAFDist: Struct {
type: String,
minVal: Double,
maxVal: Double
},
seed: Int,
evals: Struct {
PC1: Double,
PC2: Double,
PC3: Double
}
}
sa: Struct {
pop: Int,
cov1: Double,
cov2: Double,
pheno: Double,
pca: Struct {
PC1: Double,
PC2: Double,
PC3: Double
}
}
va: Struct {
ancestralAF: Double,
AF: Array[Double],
AC: Int,
linreg: Struct {
beta: Double,
se: Double,
tstat: Double,
pval: Double
},
linreg_with_PCs: Struct {
beta: Double,
se: Double,
tstat: Double,
pval: Double
}
}
The three populations are revealed by projecting samples to the first two PCs:
Looking at the QQ-plot below, we see that linear regression without PCs is massively confounded due to population stratification, whereas including two PCs corrects for this confounding.