Filter variants based on other files

I have a Hail matrix table with variants and samples (h1) and a txt file from clinvar vcf. I would like to filter out the variants (row) that are not in the clinvar txt file but I am not sure how.

Hail matrix table is keyed on chr:pos and the array of alleles. I successfully imported txt file as hail table (no key) and tried to annotate h1 with this table then filter with the new field.

However, I don’t know how to key the txt file to be the same as h1 in order to annotate it then filter. Or, am I thinking it completely wrong?

Sorry for the basic question, I started Hail only recently, would greatly appreciate if someone has some idea for the situation.

Thank you!

1 Like

How are the variants formatted in the clinvar table?

It will look something like:

# annotate the clinvar table to create a locus and alleles
clinvar = clinvar.key_by('locus', 'alleles')

# semi_join_rows keeps the rows whose keys overlap with the table's keys
mt = mt.semi_join_rows(clinvar) 

these four are the columns that should be keyed… and h1 key is like
chr1:17018956 [“A”,“T”]

The missing bit can be:


clinvar = clinvar.annotate(
    locus = hl.locus(clinvar.Chromosome, clinvar.PositionVCF, reference_genome='GRCh37'),
    alleles = [clinvar.Reference, clinvar.AlternateAlleleVCF]

I should say, if you had the VCF just importing that would make this easier!