Concordance with sparse MatrixTable

Hi all, thanks for the awesome software.
I’m trying to get my head around the concordance method (and Hail in general).
I have a SNPchip VCF (not GVCF) with ~1,000 samples and ~450K loci, and I want
to get the concordance stats against a WGS MatrixTable generated with the Hail
GVCF combiner (after combining, say, 20 single-sample GVCFs of samples that
exist in the SNPchip VCF). Here’s the process I’ve followed based on trial and
error and going through the concordance docs:


# liftover SNPchip VCF to GRCh38 to match the WGS assembly
snp_tmp = hl.import_vcf('snp.vcf.gz', reference_genome='GRCh37', force_bgz=True)
snp_tmp = liftover(snp_tmp, 'grch37_to_grch38.over.chain.gz')
snp_tmp.write('', overwrite=True)

# read mts
wgs = hl.read_matrix_table('')
snp = hl.read_matrix_table('').key_rows_by('locus', 'alleles')

# densify and lgt_to_gt
wgs = hl.experimental.densify(wgs)
wgs = wgs.annotate_entries(GT=hl.experimental.lgt_to_gt(wgs.LGT, wgs.LA))

# split multiallelic and keep biallelic rows
wgs = hl.split_multi_hts(wgs)
wgs = wgs.filter_rows(hl.len(wgs.alleles) == 2)

# unphase genotypes
wgs = wgs.annotate_entries(
    .when(wgs.GT.is_diploid(),[0], wgs.GT[1], phased=False))
    .when(wgs.GT.is_haploid(),[0], phased=False))

# run concordance
global_conc, cols_conc, rows_conc = hl.concordance(snp, wgs)

That works okay, as in I get a nice list of lists per sample in the cols_conc file after exporting, e.g:

s   concordance                                                                                                           n_discordant
T20 [[1053466,0,8416598,3186442,1783153],[50,0,141,45,35],[85872,0,242300,25,3],[542,0,42,100304,8],[362,0,23,120,53455]] 221
T21 [[951455,0,8450737,3247098,1790369],[57,0,127,35,37],[85626,0,242577,22,6],[544,0,40,100674,8],[375,0,19,106,53074]]  201
T22 [[945504,0,8521508,3165078,1807569],[27,0,99,53,40],[85640,0,242005,37,3],[532,0,53,101295,11],[371,0,17,221,52923]]  342

In the above, e.g. T20 had:

  • 1,053,466 loci/rows missing in both the snp and wgs datasets (m[0, 0])
  • 85,872 homref in snp, but missing in wgs (m[2,0])
  • 242,300 homref in both (m[2,2])
  • 42 hets in snp but homref in wgs (m[3,2])

As you can see, I get a ton of missing data (e.g. the high number of homref
calls in snp that are missing in wgs are worrying). I’ve done some digging
(after exploring the per-site concordance (rows_conc) for T20 and the
original GVCF) and it turns out these almost exclusively fall into reference
blocks for the original GVCF sample.

Looking at the wgs MT I suspect those ‘blocks’ get
filtered out when filtering for the biallelic loci in the code above. densify also
doesn’t expand all those blocks - I don’t really understand under which circumstances the expansion
happens to be honest, e.g. how does

chr1    14523  .   C    .    END=15000  GT:DP:GQ  0/0:19:40


chr1 14523 0/0
chr1 14524 0/0
chr1 14525 0/0
chr1 14526 0/0
chr1 14999 0/0

What’s a good way to get cleaner results? I originally thought of just keeping/intersecting
wgs loci that appear in the snp data (i.e. filter the wgs data down to 450K loci). But again,
don’t think densify will take care of the block expansion (correct me if I’m wrong!).
bedtools intersect for instance handles the blocks nicely.

(also posted to Zulip)

This is a great question, but I don’t have a super easy answer right now. The core issue is that we need to build up more tooling around working with the sparse matrix tables – most of Hail is historically built around “dense” MTs imported from project VCFs, but sparse MTs created using the VCF combiner are the future.

I think these concordance results are what we expect, however – densify doesn’t actually explode the rows of the matrix table, but instead scans forward entry-level reference blocks so that the entries corresponding to each variant contain reference information contained in reference blocks defined earlier. For instance, if our sparse MT has a row at chr1:15000 which defines some ref block starts, but no row at chr1:15001 (because there is no variant call or reference block start at chr1:15001 in the GVCFs), if the SNP chip dataset has a site at chr1:15001, it will fall in the concordance category of “homref in SNP, missing in WGS”.

The right solution here is that we need a concordance method that works on sparse matrix tables, because we need to join first, and densify afterwards.

I think I have a way we can do this, I’ll report back after some thought.

1 Like

I gave it a shot. Is that along the lines of what you had in mind ?