Hi, I wished to import dbSNP latest version in Hail. The official (GRCh38) VCF is available on NCBI FTP (https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz)
Notably contig names in this VCF file are RefSeq accession IDs ( i.e. NC_000001.11 for chr1) and it also includes all alternative contigs…
Thus seeking to import this VCF as a hail matrixtable, I am using hl.import_vcf
with the option contig_recoding
and skip_invalid_loci
contigs_map={
'NC_000001.11':'chr1',
'NC_000002.12':'chr2',
'NC_000003.12':'chr3',
'NC_000004.12':'chr4',
'NC_000005.10':'chr5',
'NC_000006.12':'chr6',
'NC_000007.14':'chr7',
'NC_000008.11':'chr8',
'NC_000009.12':'chr9',
'NC_000010.11':'chr10',
'NC_000011.10':'chr11',
'NC_000012.12':'chr12',
'NC_000013.11':'chr13',
'NC_000014.9':'chr14',
'NC_000015.10':'chr15',
'NC_000016.10':'chr16',
'NC_000017.11':'chr17',
'NC_000018.10':'chr18',
'NC_000019.10':'chr19',
'NC_000020.11':'chr20',
'NC_000021.9':'chr21',
'NC_000022.11':'chr22',
'NC_000023.11':'chrX',
'NC_000024.10':'chrY'
}
mt = hl.import_vcf('s3://.../GCF_000001405.39.gz',
reference_genome='GRCh38', force_bgz=True,
contig_recoding=contigs_map, skip_invalid_loci=True)
mt.show()
I get a warning - but that should not be a problem, right ?
Hail: WARN: expected input file '...' to end in .vcf[.bgz, .gz]
More problematic, I also get the following error
An error was encountered:
IllegalArgumentException: requirement failed
...
Hail version: 0.2.80-4ccfae1ff293
Error summary: IllegalArgumentException: requirement failed
Any idea what might be wrong ? and/or how to trouble shoot this shortcoming