I’m trying to calculate ld for all variants within a dataset using
ld_matrix() (see the script here). Within the dataset, there are multiallelic variants which have been split using
split_multi_hts(), and I’m curious whether or not it’s appropriate to run
ld_matrix on variants which have been split. e.g., is there any situation where two common, independent alleles might interfere with one another in the LD calculation?
Thanks in advance!
My first thought would be that it would make more sense to only include biallelic variants, since the
ld_matrix method is just computing the windowed pairwise correlation between variants.
I was just looking around a bit, and on the PLINK 2.0 linkage disequilibrium page it mentions:
Since two-variant r2 only makes sense for biallelic variants, these collapse multiallelic variants down to most common allele vs. the rest.
And this paper eLD: entropy-based linkage disequilibrium index between multiallelic sites makes it look like it is a bit more involved to include multiallelic sites in LD calculations, also stating in the abstract:
Commonly used LD indices such as r2 handle LD of biallelic variants for two sites.
Though I’m not 100% sure here, and it may be worth trying to run
ld_matrix on both just the biallelic variants, as well as the biallelic variants + split multiallelic variants and taking a look at the results.
Thanks @pwc2 , those are great resources and a very helpful approach. I’ll test running
ld_matrix on biallelic variants only and then compare that to the results with multiallelic variants included.