Hi all, thanks for the awesome software.
I’m trying to get my head around the concordance method (and Hail in general).
I have a SNPchip VCF (not GVCF) with ~1,000 samples and ~450K loci, and I want
to get the concordance stats against a WGS MatrixTable generated with the Hail
GVCF combiner (after combining, say, 20 single-sample GVCFs of samples that
exist in the SNPchip VCF). Here’s the process I’ve followed based on trial and
error and going through the concordance docs:
hl.init(default_reference='GRCh38')
# liftover SNPchip VCF to GRCh38 to match the WGS assembly
snp_tmp = hl.import_vcf('snp.vcf.gz', reference_genome='GRCh37', force_bgz=True)
snp_tmp = liftover(snp_tmp, 'grch37_to_grch38.over.chain.gz')
snp_tmp.write('snp.mt', overwrite=True)
# read mts
wgs = hl.read_matrix_table('wgs.mt')
snp = hl.read_matrix_table('snp.mt').key_rows_by('locus', 'alleles')
# densify and lgt_to_gt
wgs = hl.experimental.densify(wgs)
wgs = wgs.annotate_entries(GT=hl.experimental.lgt_to_gt(wgs.LGT, wgs.LA))
# split multiallelic and keep biallelic rows
wgs = hl.split_multi_hts(wgs)
wgs = wgs.filter_rows(hl.len(wgs.alleles) == 2)
# unphase genotypes
wgs = wgs.annotate_entries(
GT=hl.case()
.when(wgs.GT.is_diploid(), hl.call(wgs.GT[0], wgs.GT[1], phased=False))
.when(wgs.GT.is_haploid(), hl.call(wgs.GT[0], phased=False))
.default(hl.missing(hl.tcall))
)
# run concordance
global_conc, cols_conc, rows_conc = hl.concordance(snp, wgs)
That works okay, as in I get a nice list of lists per sample in the cols_conc
file after exporting, e.g:
s concordance n_discordant
T20 [[1053466,0,8416598,3186442,1783153],[50,0,141,45,35],[85872,0,242300,25,3],[542,0,42,100304,8],[362,0,23,120,53455]] 221
T21 [[951455,0,8450737,3247098,1790369],[57,0,127,35,37],[85626,0,242577,22,6],[544,0,40,100674,8],[375,0,19,106,53074]] 201
T22 [[945504,0,8521508,3165078,1807569],[27,0,99,53,40],[85640,0,242005,37,3],[532,0,53,101295,11],[371,0,17,221,52923]] 342
In the above, e.g. T20 had:
- 1,053,466 loci/rows missing in both the snp and wgs datasets (m[0, 0])
- 85,872 homref in snp, but missing in wgs (m[2,0])
- 242,300 homref in both (m[2,2])
- 42 hets in snp but homref in wgs (m[3,2])
As you can see, I get a ton of missing data (e.g. the high number of homref
calls in snp that are missing in wgs are worrying). I’ve done some digging
(after exploring the per-site concordance (rows_conc
) for T20 and the
original GVCF) and it turns out these almost exclusively fall into reference
blocks for the original GVCF sample.
Looking at the wgs MT I suspect those ‘blocks’ get
filtered out when filtering for the biallelic loci in the code above. densify
also
doesn’t expand all those blocks - I don’t really understand under which circumstances the expansion
happens to be honest, e.g. how does
#CHROM POS ID REF ALT INFO FORMAT SAMPLE_1
chr1 14523 . C . END=15000 GT:DP:GQ 0/0:19:40
become
chr1 14523 0/0
chr1 14524 0/0
chr1 14525 0/0
chr1 14526 0/0
....
chr1 14999 0/0
What’s a good way to get cleaner results? I originally thought of just keeping/intersecting
wgs loci that appear in the snp data (i.e. filter the wgs data down to 450K loci). But again,
don’t think densify
will take care of the block expansion (correct me if I’m wrong!).
bedtools intersect
for instance handles the blocks nicely.
(also posted to Zulip)