Generate synthetic datasets with the Balding-Nichols model

For our first method of generating large sythetic datasets, Hail now provides a distributed implementation of the Balding-Nichols model. This model captures basic population structure as described in detail in the docs here.

As a simple example of data science, let’s generate a genetic dataset, filter to common variants, generate random sample covariates and a phenotype that also depends on the population label, run linear regression without and then with principal component covariates, and export the results:

from hail import *
hc = HailContext()

vds = (hc.balding_nichols_model(3, 1000, 1000, pop_dist = [0.2, 0.3, 0.5], fst = [.05, .09, .11])
    .annotate_variants_expr('va.AC = gs.map(g => g.nNonRefAlleles).sum()') 
    .filter_variants_expr('va.AC >= 100 && va.AC <= 1900') 
    .annotate_samples_expr('sa.cov1 = rnorm(0, 1), sa.cov2 = rnorm(0, 1)') 
    .annotate_samples_expr('sa.pheno = rnorm(1, 1) + 2 * sa.cov1 - sa.cov2 + .2 * sa.pop') 
    .linreg('sa.pheno', covariates = ['sa.cov1', 'sa.cov2'], root = 'va.linreg') 
    .pca(scores = 'sa.pca', eigenvalues = 'global.evals', k = 3)
    .linreg('sa.pheno', covariates = ['sa.cov1', 'sa.cov2', 'sa.pca.PC1', 'sa.pca.PC2'], root = 'va.linreg_with_PCs'))

vds.export_samples('pca.tsv', 'PC1 = sa.pca.PC1, PC2 = sa.pca.PC2, PC3 = sa.pca.PC3, pop = sa.pop')
vds.export_variants('pvals.tsv', 'pval = va.linreg.pval, pvals_with_PCs = va.linreg_with_PCs.pval')

Here is the resulting annotation schema, which includes all the parameters used in the Balding-Nichols model:

globals: Struct {
    nPops: Int,
    nSamples: Int,
    nVariants: Int,
    popDist: Array[Double],
    Fst: Array[Double],
    ancestralAFDist: Struct {
        type: String,
        minVal: Double,
        maxVal: Double
    },
    seed: Int,
    evals: Struct {
        PC1: Double,
        PC2: Double,
        PC3: Double
    }
}
sa: Struct {
    pop: Int,
    cov1: Double,
    cov2: Double,
    pheno: Double,
    pca: Struct {
        PC1: Double,
        PC2: Double,
        PC3: Double
    }
}
va: Struct {
    ancestralAF: Double,
    AF: Array[Double],
    AC: Int,
    linreg: Struct {
        beta: Double,
        se: Double,
        tstat: Double,
        pval: Double
    },
    linreg_with_PCs: Struct {
        beta: Double,
        se: Double,
        tstat: Double,
        pval: Double
    }
}

The three populations are revealed by projecting samples to the first two PCs:

Looking at the QQ-plot below, we see that linear regression without PCs is massively confounded due to population stratification, whereas including two PCs corrects for this confounding.