Help reformatting variant ID string to annotate MatrixTable keyed by locus, allele

Hello! I’m new to Hail so apologies if this question has been asked before or is a bit basic.

I am trying to annotate a matrix table of genomic data (mt) keyed by [locus, allele], with variant annotations in another hail MT (vat_table). Specifically, I would like to eventually filter the variants mt by their corresponding variant type found in vat_table.variant_type.

However, the fields in vat_table don’t seem to be well-formatted to parsing as loci. The variant ID field (vat_table.vid) is formatted as ‘1-10001-T-C’ (str), which is not recognized by hl.parse_variant(). I can use the replace(‘-’, ‘:’) method to change the format, but then calling:

hl.parse_variant(vat_table.vid, reference_genome = ‘GRCh38’)

throws an error, since the contig needs to be in the form ‘chr1’. One solution would be to concatenate ‘chr’ with each of the vat_table.vid character strings, but I am having trouble figuring out how to do this in Hail.

Another option would be to just concatenate four fields in of character strings, vat_table.contig (ex: ‘chr1’), vat_table.position (ex: ‘10001’), vat_table.ref_allele (ex: ‘T’), and vat_table.alt_allele (ex: ‘C’) to form a DIY vid field. However, I am having trouble doing this as well.

A final option would be to change the format of the the vat_table.vid field when I originally load the table with the call below:

vat_table = hl.import_table(vat_path, force = True, quote = ‘"’, delimiter = “\t”, force_bgz = True)

I see this thread advising on how to adjust the contig format when reading in a VCF, but I’m not sure if there is a nice way to adjust this from a tab-delimited file. I am unfortunately not able to make any changed directly in the source file.

Hopefully that is clear, and thanks so much for you help!

2 Likes

Hi,

If you’d like to go with the first option, you could do something along these lines:

vat_table = hl.import_table(...)
vat_table = vat_table.annotate(vid = "chr" + vat_table.vid.replace("-", ":"))

For the second,

vat_table = vat_table.annotate(vid = vat_table.contig + ":" + vat_table.position + ":" + vat_table.ref_allele + ":" + vat_table.alt_allele)

As far as I can tell, we unfortunately don’t have an equivalent to the contig_recoding keyword for import_table, but someone else from the team may be able to chime in if I’m missing something there.

Hope that helps!

2 Likes