Hi all! I know this has been posted a few times, but I do not think that any of the solutions are applicable to this. I am trying to import a gnomAD VCF (gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz) with a simple chunk of code:
recode = {f"{i}":f"chr{i}" for i in (list(range(1, 23)) + ['X', 'Y'])}
mt = hl.import_vcf("./gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz", force_bgz=True, reference_genome='GRCh38', contig_recoding=recode)
mt.write('gnomad.mt', overwrite = True)
But I ran into this issue:
Hail version: 0.2.108-fc03e9d5dc08
Error summary: ClassTooLargeException: Class too large: __C30721collect_distributed_array_matrix_native_writer
I am not too sure how to resolve this, but weirdly, when I ran this on the non-liftover VCF (gnomad.exomes.r2.1.1.sites.vcf.bgz), it actually worked fine. Will appreciate any input on this matter!
Hey, @MsUTR , I’m sorry you’re running into this issue. We’ll look into it, but this kind of problem is hard to fix. The root cause is that this VCF file has a very large number of fields. Hail’s parser generates code so that it can parse these kinds of VCFs really fast but, due to JVM limitations, that code can grow too large.
This isn’t particularly high priority for us because gnomAD has publicly released Hail Tables for these VCFs. Using a Hail Table avoids the parsing problem and saves you the cost of importing. Can you use one of these tables instead?