Help reformatting variant ID string to annotate MatrixTable keyed by locus, allele

Hello! I’m new to Hail so apologies if this question has been asked before or is a bit basic.

I am trying to annotate a matrix table of genomic data (mt) keyed by [locus, allele], with variant annotations in another hail MT (vat_table). Specifically, I would like to eventually filter the variants mt by their corresponding variant type found in vat_table.variant_type.

However, the fields in vat_table don’t seem to be well-formatted to parsing as loci. The variant ID field (vat_table.vid) is formatted as ‘1-10001-T-C’ (str), which is not recognized by hl.parse_variant(). I can use the replace(‘-’, ‘:’) method to change the format, but then calling:

hl.parse_variant(vat_table.vid, reference_genome = ‘GRCh38’)

throws an error, since the contig needs to be in the form ‘chr1’. One solution would be to concatenate ‘chr’ with each of the vat_table.vid character strings, but I am having trouble figuring out how to do this in Hail.

Another option would be to just concatenate four fields in of character strings, vat_table.contig (ex: ‘chr1’), vat_table.position (ex: ‘10001’), vat_table.ref_allele (ex: ‘T’), and vat_table.alt_allele (ex: ‘C’) to form a DIY vid field. However, I am having trouble doing this as well.

A final option would be to change the format of the the vat_table.vid field when I originally load the table with the call below:

vat_table = hl.import_table(vat_path, force = True, quote = ‘"’, delimiter = “\t”, force_bgz = True)

I see this thread advising on how to adjust the contig format when reading in a VCF, but I’m not sure if there is a nice way to adjust this from a tab-delimited file. I am unfortunately not able to make any changed directly in the source file.

Hopefully that is clear, and thanks so much for you help!



If you’d like to go with the first option, you could do something along these lines:

vat_table = hl.import_table(...)
vat_table = vat_table.annotate(vid = "chr" + vat_table.vid.replace("-", ":"))

For the second,

vat_table = vat_table.annotate(vid = vat_table.contig + ":" + vat_table.position + ":" + vat_table.ref_allele + ":" + vat_table.alt_allele)

As far as I can tell, we unfortunately don’t have an equivalent to the contig_recoding keyword for import_table, but someone else from the team may be able to chime in if I’m missing something there.

Hope that helps!