[Experimental] Population Aware Relatedness Estimation

Hail 0.1 now has an experimental new method: VariantDataset.pc_relate which estimates relatedness of a dataset using the PC-Relate algorithm described in “Model-free Estimation of Recent Genetic Relatedness”. This improves upon our existing VariantDataset.ibd (PLINK-style) method, which produces population-biased estimates of kinship, identity-by-descent zero, identity-by-descent one, and identity-by-descent two.

We’re mostly confident the algorithm is implemented exactly as described in the paper; however, we have not reached parity with PC-Relate R implementation on all inputs. We hope that by releasing it publicly, we can learn more about how it performs on real datasets. You’ll see below that the estimator isn’t perfect on a Thousand Genomes dataset.

I created a public notebook containing an example of applying PC Relate to a QC’d Thousand Genomes dataset. After the pc_relate results, there is a cell showing biased estimates from ibd on the same dataset. The final cell of the notebook gives an example of a synthetic dataset with no related individuals but too many populations relative to the number of principal components supplied. I’ve copied the code, outputs, and plots below for your perusal.

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn

from hail import *
hc = HailContext()

vds = hc.read('gs://hail-jigold/ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.qc.vds')
vds.summarize()
Summary(samples=2316, variants=1642696, call_rate=0.997784, contigs=['12', '8', '19', '4', '15', '11', '9', '22', '13', '16', '5', '10', '21', '6', '1', '17', '14', '20', '2', '18', '7', '3'], multiallelics=0, snps=1642696, mnps=0, insertions=0, deletions=0, complex=0, star=0, max_alleles=2)

pruned = (vds
   .ld_prune(r2=0.1, memory_per_core=1024, num_cores=40)
   .persist())
pruned.summarize()
Summary(samples=2316, variants=402201, call_rate=0.997888, contigs=['12', '8', '19', '4', '15', '11', '9', '22', '13', '16', '5', '10', '21', '6', '1', '17', '14', '20', '2', '18', '7', '3'], multiallelics=0, snps=402201, mnps=0, insertions=0, deletions=0, complex=0, star=0, max_alleles=2)

pruned_subsample = pruned.sample_variants(0.2).persist()
pruned.unpersist()
pruned_subsample.export_plink('gs://danking/pc-relate-1kg-2017-08-11')
print(pruned_subsample.summarize())
kin = (pruned_subsample.pc_relate(20, 0.05, 512).persist())
df = kin.to_pandas()
df.describe()
Summary(samples=2316, variants=80682, call_rate=0.997887, contigs=['12', '8', '19', '4', '15', '11', '9', '22', '13', '16', '5', '10', '21', '6', '1', '17', '14', '20', '2', '18', '7', '3'], multiallelics=0, snps=80682, mnps=0, insertions=0, deletions=0, complex=0, star=0, max_alleles=2)

		kin				k0				k1				k2
count	3.870346e+06	3.870346e+06	3.870346e+06	3.870346e+06
mean	-2.599914e-04	1.001038e+00	-1.037799e-03	-1.676766e-07
std		5.603752e-03	2.379242e-02	2.752507e-02	8.101068e-03
min		-2.709853e-01	0.000000e+00	-1.076259e+00	-1.219171e-01
25%		-2.956884e-03	9.893576e-01	-1.652265e-02	-5.425395e-03
50%		-3.120027e-04	1.001258e+00	-1.285124e-03	-8.016450e-06
75%		2.320787e-03	1.013173e+00	1.395317e-02	5.413372e-03
max		2.879029e-01	2.070329e+00	1.086610e+00	2.900364e-01

import csv
with hadoop_read('gs://hail-1kg/parent-child-pairs.csv') as f:
    reader = csv.reader(f)
    parent_childs = set(map(lambda x: (x[0], x[1]), reader))

with hadoop_read('gs://hail-1kg/sibling-pairs.csv') as f:
    reader = csv.reader(f)
    siblings = set(map(lambda x: (x[0], x[1]), reader))

def choosecolor(row):
    if ((row['i'], row['j']) in parent_childs or (row['j'], row['i']) in parent_childs):
        return 'red'
    elif ((row['i'], row['j']) in siblings or (row['j'], row['i']) in siblings):
        return 'blue'
    else:
        return 'grey'

# this line is super slow
df['color'] = df.apply(lambda row: choosecolor(row), axis = 1)

Hail PC-Relate

Kinship vs IBD-0

NB: PC-Relate kinship ranges from 0-0.5, where 0.5 are monozygotic twins.

df_kink0_subsample = pd.concat([df[(df['kin'] > 0.05) | (df['k0'] < 0.9)], df[(df['kin'] <= 0.05) & (df['k0'] >= 0.9)].sample(frac=0.0001)])
df_kink0_subsample.plot(kind='scatter', x = 'k0', y = 'kin', alpha=0.25, figsize=(10,10), c=df_kink0_subsample['color'])

IBD-2 vs IBD-0

df_k0k2_subsample = pd.concat([df[(df['k2'] > 0.05) | (df['k0'] < 0.9)], df[(df['k2'] <= 0.05) & (df['k0'] >= 0.9)].sample(frac=0.01)])
df_k0k2_subsample.plot(kind='scatter', x = 'k0', y = 'k2', alpha=0.25, figsize=(10,10), c=df_k0k2_subsample['color'])

Hail’s PLINK-style IBD

Kinship vs IBD-0

NB: PLINK-style kinship ranges from 0.0-1.0, where 1.0 are monozygotic twins.

plink_kin = pruned_subsample.ibd()
plink_df = plink_kin.to_pandas()
plink_df = plink_df.rename(index=str, columns={'ibd.Z0': "k0", 'ibd.Z1' : 'k1', 'ibd.Z2' : 'k2', 'ibd.PI_HAT': 'kin'})

def kink0plot(df):
    df['color'] = df.apply(lambda row: choosecolor(row), axis = 1)
    df_kink0_subsample = pd.concat([df[(df['kin'] > 0.05) | (df['k0'] < 0.9)], df[(df['kin'] <= 0.05) & (df['k0'] >= 0.9)].sample(frac=0.0001)])
    df_kink0_subsample.plot(kind='scatter', x = 'k0', y = 'kin', alpha=0.25, figsize=(10,10), c=df_kink0_subsample['color'])

kink0plot(plink_df)

IBD-2 vs IBD-0

plink_df_k0k2_subsample = pd.concat([plink_df[(plink_df['k2'] > 0.05) | (plink_df['k0'] < 0.9)], plink_df[(plink_df['k2'] <= 0.05) & (plink_df['k0'] >= 0.9)].sample(frac=0.01)])
plink_df_k0k2_subsample.plot(kind='scatter', x = 'k0', y = 'k2', alpha=0.25, figsize=(10,10), c=plink_df_k0k2_subsample['color'])

4 Populations in Balding-Nichols

We use two principal components which cannot distinguish the four populations which the Balding-Nichols model places at the four points of a tetrahedron (a 3-simplex). Three principal coordinates would be sufficient to distinguish these populations.

bn4_vds = hc.balding_nichols_model(4, 500, 50000)
bn4_kin_df = (bn4_vds
    .pc_relate(2, 0.05, 512)
    .to_pandas())
bn4_kin_df.sample(frac=0.01).plot(kind='scatter', x = 'k0', y = 'kin', alpha=0.25, figsize=(15,15), c='grey')

bn4_vds = hc.balding_nichols_model(4, 500, 50000)
bn4_kin_df = (bn4_vds
    .pc_relate(3, 0.05, 512)
    .to_pandas())
bn4_kin_df.sample(frac=0.01).plot(kind='scatter', x = 'k0', y = 'kin', alpha=0.25, figsize=(15,15), c='grey')