Reference genome

I see this when I use default_reference=‘GRCh38’

Hail version: 0.2.79-f141af259254
Error summary: HailException: Invalid locus ‘chr1:249066372’ found. Position ‘249066372’ is not within the range [1-248956422] for reference genome ‘GRCh38’.

The vcfs are from an experiment done using the llumina Nextera Rapid Capture Expanded Exome Kit.Nextera Expanded Exome

As I see it, the error message is pretty on point. Chromosome 1 has a length of 248,956,422 and your locus is ~110k basepairs beyond that. Might be a sequencing error of sorts?

1 Like

@jsmadsen is absolutely on point. @Manimala , I suspect your data is actually aligned to GRCh37 in which chromosome 1 is 249,250,621 base pairs long.

So when I use the default genome GRCh37 - I see this error - Contig ‘chrM’ is not in the reference genome ‘GRCh37’

I am guessing I have to create a custom ref genome.

Let me know how to do that

@Manimala , ah sorry to hear you’re still having trouble. In GRCh37, the mitochondrial config is called “MT” with no “chr” prefix and with a trailing “T”. Our source of truth for chromosome names in GRCh37 is: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/human_g1k_v37.dict.

We quite often encounter VCFs whose contigs use non-standard names, so we provide the contig_recoding parameter to import_vcf. It seems like you need a contig recoding like this:

mt = hl.import_vcf(
    ...,
    contig_recoding={**{f'chr{x}': str(x) for x in range(1, 23)},
                     'chrX': 'X',
                     'chrY': 'Y',
                     'chrM': 'MT'}
)