Hi all, thanks for the awesome software.
I’m trying to get my head around the concordance method (and Hail in general).
I have a SNPchip VCF (not GVCF) with ~1,000 samples and ~450K loci, and I want
to get the concordance stats against a WGS MatrixTable generated with the Hail
GVCF combiner (after combining, say, 20 single-sample GVCFs of samples that
exist in the SNPchip VCF). Here’s the process I’ve followed based on trial and
error and going through the concordance docs:
hl.init(default_reference='GRCh38') # liftover SNPchip VCF to GRCh38 to match the WGS assembly snp_tmp = hl.import_vcf('snp.vcf.gz', reference_genome='GRCh37', force_bgz=True) snp_tmp = liftover(snp_tmp, 'grch37_to_grch38.over.chain.gz') snp_tmp.write('snp.mt', overwrite=True) # read mts wgs = hl.read_matrix_table('wgs.mt') snp = hl.read_matrix_table('snp.mt').key_rows_by('locus', 'alleles') # densify and lgt_to_gt wgs = hl.experimental.densify(wgs) wgs = wgs.annotate_entries(GT=hl.experimental.lgt_to_gt(wgs.LGT, wgs.LA)) # split multiallelic and keep biallelic rows wgs = hl.split_multi_hts(wgs) wgs = wgs.filter_rows(hl.len(wgs.alleles) == 2) # unphase genotypes wgs = wgs.annotate_entries( GT=hl.case() .when(wgs.GT.is_diploid(), hl.call(wgs.GT, wgs.GT, phased=False)) .when(wgs.GT.is_haploid(), hl.call(wgs.GT, phased=False)) .default(hl.missing(hl.tcall)) ) # run concordance global_conc, cols_conc, rows_conc = hl.concordance(snp, wgs)
That works okay, as in I get a nice list of lists per sample in the
cols_conc file after exporting, e.g:
s concordance n_discordant T20 [[1053466,0,8416598,3186442,1783153],[50,0,141,45,35],[85872,0,242300,25,3],[542,0,42,100304,8],[362,0,23,120,53455]] 221 T21 [[951455,0,8450737,3247098,1790369],[57,0,127,35,37],[85626,0,242577,22,6],[544,0,40,100674,8],[375,0,19,106,53074]] 201 T22 [[945504,0,8521508,3165078,1807569],[27,0,99,53,40],[85640,0,242005,37,3],[532,0,53,101295,11],[371,0,17,221,52923]] 342
In the above, e.g. T20 had:
- 1,053,466 loci/rows missing in both the snp and wgs datasets (m[0, 0])
- 85,872 homref in snp, but missing in wgs (m[2,0])
- 242,300 homref in both (m[2,2])
- 42 hets in snp but homref in wgs (m[3,2])
As you can see, I get a ton of missing data (e.g. the high number of homref
calls in snp that are missing in wgs are worrying). I’ve done some digging
(after exploring the per-site concordance (
rows_conc) for T20 and the
original GVCF) and it turns out these almost exclusively fall into reference
blocks for the original GVCF sample.
Looking at the wgs MT I suspect those ‘blocks’ get
filtered out when filtering for the biallelic loci in the code above.
doesn’t expand all those blocks - I don’t really understand under which circumstances the expansion
happens to be honest, e.g. how does
#CHROM POS ID REF ALT INFO FORMAT SAMPLE_1 chr1 14523 . C . END=15000 GT:DP:GQ 0/0:19:40
chr1 14523 0/0 chr1 14524 0/0 chr1 14525 0/0 chr1 14526 0/0 .... chr1 14999 0/0
What’s a good way to get cleaner results? I originally thought of just keeping/intersecting
wgs loci that appear in the snp data (i.e. filter the wgs data down to 450K loci). But again,
densify will take care of the block expansion (correct me if I’m wrong!).
bedtools intersect for instance handles the blocks nicely.
(also posted to Zulip)