Generate synthetic datasets with the Balding-Nichols model

johnc1231 · February 2, 2017, 7:49pm

For our first method of generating large sythetic datasets, Hail now provides a distributed implementation of the Balding-Nichols model. This model captures basic population structure as described in detail in the docs here.

As a simple example of data science, let’s generate a genetic dataset, filter to common variants, generate random sample covariates and a phenotype that also depends on the population label, run linear regression without and then with principal component covariates, and export the results:

from hail import *
hc = HailContext()

vds = (hc.balding_nichols_model(3, 1000, 1000, pop_dist = [0.2, 0.3, 0.5], fst = [.05, .09, .11])
    .annotate_variants_expr('va.AC = gs.map(g => g.nNonRefAlleles).sum()') 
    .filter_variants_expr('va.AC >= 100 && va.AC <= 1900') 
    .annotate_samples_expr('sa.cov1 = rnorm(0, 1), sa.cov2 = rnorm(0, 1)') 
    .annotate_samples_expr('sa.pheno = rnorm(1, 1) + 2 * sa.cov1 - sa.cov2 + .2 * sa.pop') 
    .linreg('sa.pheno', covariates = ['sa.cov1', 'sa.cov2'], root = 'va.linreg') 
    .pca(scores = 'sa.pca', eigenvalues = 'global.evals', k = 3)
    .linreg('sa.pheno', covariates = ['sa.cov1', 'sa.cov2', 'sa.pca.PC1', 'sa.pca.PC2'], root = 'va.linreg_with_PCs'))

vds.export_samples('pca.tsv', 'PC1 = sa.pca.PC1, PC2 = sa.pca.PC2, PC3 = sa.pca.PC3, pop = sa.pop')
vds.export_variants('pvals.tsv', 'pval = va.linreg.pval, pvals_with_PCs = va.linreg_with_PCs.pval')

Here is the resulting annotation schema, which includes all the parameters used in the Balding-Nichols model:

globals: Struct {
    nPops: Int,
    nSamples: Int,
    nVariants: Int,
    popDist: Array[Double],
    Fst: Array[Double],
    ancestralAFDist: Struct {
        type: String,
        minVal: Double,
        maxVal: Double
    },
    seed: Int,
    evals: Struct {
        PC1: Double,
        PC2: Double,
        PC3: Double
    }
}
sa: Struct {
    pop: Int,
    cov1: Double,
    cov2: Double,
    pheno: Double,
    pca: Struct {
        PC1: Double,
        PC2: Double,
        PC3: Double
    }
}
va: Struct {
    ancestralAF: Double,
    AF: Array[Double],
    AC: Int,
    linreg: Struct {
        beta: Double,
        se: Double,
        tstat: Double,
        pval: Double
    },
    linreg_with_PCs: Struct {
        beta: Double,
        se: Double,
        tstat: Double,
        pval: Double
    }
}

The three populations are revealed by projecting samples to the first two PCs:

Looking at the QQ-plot below, we see that linear regression without PCs is massively confounded due to population stratification, whereas including two PCs corrects for this confounding.

Topic		Replies	Views
PheWAS on DNAnexus UKB RAP Hail Query & hailctl	13	853	December 21, 2022
Fast linear regression for multiple phenotypes, eQTLs Updates	24	3201	July 17, 2018
Ancestry inference in Hail Help [0.1]	7	1855	February 28, 2018
GWAS with imputed data Hail Query & hailctl	11	990	March 10, 2020
[Breaking Change] Hail 0.2 removed KinshipMatrix, linear_mixed_regression Updates	17	1337	October 24, 2018

Generate synthetic datasets with the Balding-Nichols model

Related topics