Fail to read a T2T VCF

Trying to import one of the VCFs found here

The risk was high – but the closest reference available is GRCh38

h.init(default_reference='GRCh38')
h.import_vcf("chr17.genotyped.vcf.gz", force_bgz=True).write("t2t17.mt")
Error summary: HailException: Invalid locus 'chr17:83402572' found. Position '83402572' is not within the range [1-83257441] for reference genome 'GRCh38'.
[Stage 0:>   

There’s probably a workaround defining the allowable ranges? But I could not find it and I think the doc on reference genome identification and modification could be enhanced to directly address the use case: I need to specify a complete new reference. Or I missed it. I see how to create a ReferenceGenome instance from JSON but I don’t see how to impose it in the hail instance.

this should work:

rg = hl.ReferenceGenome.read(json_file)
hl.import_vcf("chr17.genotyped.vcf.gz", force_bgz=True, reference_genome=rg)

I don’t think you can set the default RG to a custom one – this requires a slight redesign of the client session model and isn’t super high priority right now.

Should only need to do this once, though, since reading a table/matrixtable with a custom reference will load that reference into the session. In the future you can read that MT and do rg = mt.locus.dtype.reference_genome if you need to use it elsewhere.

Should only need to do this once, though, since reading a table/matrixtable with a custom reference will load that reference into the session. In the future you can read that MT and do rg = mt.locus.dtype.reference_genome if you need to use it elsewhere.

Thanks so much for such rapid response! Before I go after this, is it sufficient to define the reference genome in json for the single contig of interest, or do I need to define the whole genome?

Defining a RG with a single contig will work if your VCF only has data on that chromosome, but if you later need to work with multiple contigs, you’ll need to either reimport or define a liftover mapping from your chr17-only-rg to a more complete reference genome.

Success! It required some conniving though, because the VCF had more positions than NCBI claims for the contig. When I get a proper reference genome spec I will post it here for posterity.

1 Like

To bring this to closure this json file gives chromosome lengths for autosomes of the T2T reference, and can be used with ReferenceGenome.read to work with VCFs created with this reference.

2 Likes

Thanks for sharing!