Help reformatting variant ID string to annotate MatrixTable keyed by locus, allele

jbs · October 6, 2023, 10:29pm

Hello! I’m new to Hail so apologies if this question has been asked before or is a bit basic.

I am trying to annotate a matrix table of genomic data (mt) keyed by [locus, allele], with variant annotations in another hail MT (vat_table). Specifically, I would like to eventually filter the variants mt by their corresponding variant type found in vat_table.variant_type.

However, the fields in vat_table don’t seem to be well-formatted to parsing as loci. The variant ID field (vat_table.vid) is formatted as ‘1-10001-T-C’ (str), which is not recognized by hl.parse_variant(). I can use the replace(‘-’, ‘:’) method to change the format, but then calling:

hl.parse_variant(vat_table.vid, reference_genome = ‘GRCh38’)

throws an error, since the contig needs to be in the form ‘chr1’. One solution would be to concatenate ‘chr’ with each of the vat_table.vid character strings, but I am having trouble figuring out how to do this in Hail.

Another option would be to just concatenate four fields in of character strings, vat_table.contig (ex: ‘chr1’), vat_table.position (ex: ‘10001’), vat_table.ref_allele (ex: ‘T’), and vat_table.alt_allele (ex: ‘C’) to form a DIY vid field. However, I am having trouble doing this as well.

A final option would be to change the format of the the vat_table.vid field when I originally load the table with the call below:

vat_table = hl.import_table(vat_path, force = True, quote = ‘"’, delimiter = “\t”, force_bgz = True)

I see this thread advising on how to adjust the contig format when reading in a VCF, but I’m not sure if there is a nice way to adjust this from a tab-delimited file. I am unfortunately not able to make any changed directly in the source file.

Hopefully that is clear, and thanks so much for you help!

iris-garden · October 12, 2023, 4:41pm

Hi,

If you’d like to go with the first option, you could do something along these lines:

vat_table = hl.import_table(...)
vat_table = vat_table.annotate(vid = "chr" + vat_table.vid.replace("-", ":"))

For the second,

vat_table = vat_table.annotate(vid = vat_table.contig + ":" + vat_table.position + ":" + vat_table.ref_allele + ":" + vat_table.alt_allele)

As far as I can tell, we unfortunately don’t have an equivalent to the contig_recoding keyword for import_table, but someone else from the team may be able to chime in if I’m missing something there.

Hope that helps!

Topic		Replies	Views
Creating a Variant representation on a V0.2 Table Hail Query & hailctl	4	499	February 12, 2019
Filter variants based on other files Hail Query & hailctl	3	423	February 9, 2022
Variant annotation in MatrixTable Hail Query & hailctl	13	793	July 3, 2020
How to recode config in a matrix table Hail Query & hailctl	0	18	July 29, 2024
Import existing VEP annotations from vcf or CSQ Hail Query & hailctl	10	1521	November 27, 2019

Help reformatting variant ID string to annotate MatrixTable keyed by locus, allele

Related topics