Annotation text file

I’m trying to get ensembl gene annotation text file to the genome version GRCh38.
I need human dataset of the Ensembl GRCh38 that work with hail.

I have BioMart version, but I keep getting errors like this:
Hail version: 0.2.49-11ae8408bad0
Error summary: HailException: Invalid interval ‘[Y:2784749-Y:2784853)’ found. Contig ‘Y’ is not in the reference genome ‘GRCh38’.

this is thr error I got after running this code:
gene_ht = gene_ht.transmute(interval = hl.locus_interval(gene_ht.chromosome,
gene_ht.end_position, reference_genome=“GRCh38”))

I’m not from the US region, so the annotation db was not much of a help for me.
Thank you so much in advance!

Hi @Shiri.Margalit , what region are you in? Are you using GCP or AWS?

You might try this:

contig_recoding = {str(i): 'chr' + str(i) for i in range(1, 23)}
contig_recoding.update({'X': 'chrX', 'Y': 'chrY', 'MT': 'chrM'})
hl_contig_recoding = hl.literal(contig_recoding)

gene_ht = gene_ht.annotate(chromosome = hl_contig_recoding[ht.chromosome])

before the transmute.

Thank you for the fast reply!
It worked, but unfortunately partly, I think it’s because the BioMart annotation has this chromosome names: KI270442.1, CHR_HSCHRX_2_CTG3 and more.
Do I need to erase it and leave only the following chromosomes: 1-22, X, Y, MT?

This is the error I’m getting:
Hail version: 0.2.49-11ae8408bad0
Error summary: HailException: Key ‘KI270442.1’ not found in dictionary. Keys: [“1”,“10”,“11”,“12”,“13”,“14”,“15”,“16”,“17”,“18”,“19”,“2”,“20”,“21”,“22”,“3”,“4”,“5”,“6”,“7”,“8”,“9”,“MT”,“X”,“Y”]

And I’m from Israel, and working on a private University storage service (not AWS or GDP).

Thank you,

Yes, I would filter out all the non-standard contigs. You might find hl.valid_locus useful.

It worked!
Thank you :slight_smile:

1 Like