I have two vcfs. Both vcfs have multiallelic sites which have been split. One vcf is my data and the other vcf is 1,000 genome data. In order to make sure, I do not have multiallelic sites, I would like to filter one vcf by position that are present in the intervals of another vcf.
For example, if vcf 1 has a variant position in chr1:1-5 and vcf2 has a variant in position in chr1:4, then I would like to keep the variant in vcf2 at position chr1:4.
It’s something along the lines of mt1 = mt1.filter_rows(hl.is_defined(mt2[mt1.locus]))
(possible needs mt1.alleles
too, depending on keys).
If I do that, I get the following error:
ExpressionException: Key type mismatch: cannot index matrix table with given expressions:
MatrixTable row key: locus<GRCh38>, array<str>
Index expressions: locus<GRCh38>
It works if I use the mt2[mt1.alleles]
command. However, I wonder whether if I import a vcf where the multi allelic sites have been split for both mt1 and mt2 and use mt2[mt1.alleles]
, will multi allelic sites be included in the indexed result?
For example, let’s say at Chromosome 1, position 2, the reference is G. mt1 has samples with alternate of C and T. mt2 has alternate alleles of C. In this case, would the alternate allele of C be included in the indexed result?
If you have not already solved it, would mt.filter_rows(hl.len(mt.alleles) == 2)
a solution? Alternatively, it seems one of the VCFs did for some reason not get keyed by alleles
causing the mismatch. You could either add or drop that from the index with key_by
.