I cannot import VCF to Hail

Continuing the discussion from I cannot import the UKB 200K WGS VCFs in Hail due to an empty line in the VCF after the header lines:

Hi guys could you please help me to resolve issue with VCF import to Hail I have similar issues as described abow.


here is my script
inputs=[‘/content/data_GWAS/HGDP00224.hgdp_wgs.20190516.phase10x.vcf.gz’,‘/content/data_GWAS/HGDP00228.hgdp_wgs.20190516.phase10x.vcf.gz’]

output_file = ‘/content/combined/output.mt’ # output destination

temp_bucket = ‘gs://my-temp-bucket’ # bucket for storing intermediate files

hl.experimental.run_combiner(inputs, out_file=output_file, tmp_path=temp_bucket,use_genome_default_intervals=True, reference_genome=‘GRCh38’)

The hl.experimental.run_combiner work fine. But output matrix is empty. There is sample name, genomic coordinate. But there is no alleles

How to resolve the problem?

what’s the output of:

mt = hl.read_matrix_table('/content/combined/output.mt')
print(mt.count())
print(mt._force_count_rows())

is /content a network-visible directory?

Yes i working on Google Colab VM

Rady
image

Looks like the matrix isn’t empty – what specifically is the problem?

There is no alleles column
image

Ah! Some of this code is in a transition phase, and is a bit disorganized. You can add the alleles key in the run_combiner function all using key_by_locus_and_alleles=True

I have restart script with the flag key_by_locus_and_alleles=True . And get following error

Sorry. 2 more min

I have restart script with flag key_by_locus_and_alleles=True . We have to wait. It takes approximately 1h to merge two VCF on Google Colab VM.

oof, that’s slow.

I also concerned about SNP number. In my case we have 330226136. I think it is too much for human genome?

Yes, Google VMs wery slow.Yes, Google VMs very slow. It is only 2-core CPU,14 GB of RAM and 25Gb disc space. Actually, how mach CPU (GPU) core i have to reqwest to make the merging fuster?

you also might try writing to a Google Storage output location instead of /content, I imagine the disk bandwidth might be a bottleneck here.

I still have got a problem. I have download two VCF files from 1000 g project.
/ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516/gVCFs/
The VCF are
HGDP00001.hgdp_wgs.20190516.vcf.gz HGDP00003.hgdp_wgs.20190516.vcf.gz
When i trying to merg this two fils with comand
hl.experimental.run_combiner(inputs, out_file=output_file, tmp_path=temp_bucket,use_genome_default_intervals=True, key_by_locus_and_alleles=True, reference_genome=‘GRCh38’)
I get an error

however the VCF files have to be fine. How to resolv the problem with Hail? It shuld be a way to fix the VCF files? Could you sugest a source of VCF to test hl.experimental.run_combiner? I hawe to make it functional. Otherwise Hail is absolutely useless software because it impossible to import anything in the Hail.

Hey, could you please reply to issue stated above

I’ll take a look at these GVCFs later today to see what’s going on. I suspect that there is some violation of our assumed data model in these GVCFs – in particular, that there is only one record for any chr/position – but there might be a workaround.

It appears there are alleles with numeric representations (1, 2) which is particularly concerning.

There are two rows with the same locus in that second GVCF:

chr10   132792081       .       CGGAACCGTGTGGGTGCAGCATCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGTCCCGGAACCGTGTGGGTACAGCGTCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCATCTACACTGGGCCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGCCCGGGAACCGTGGGGGTGCAGCGTCTACACTGGGTCCG     C       321326  PASS    VQSLOD=0.501;VQSRMODE=INDEL;ExcHet=7    GT:AD:DP:GQ:PGT:PID:PL     0/1:21,0,5:26:99:0|1:132792101_ATCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGTCCCGGAACCGTGTGGGTACAGCGTCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCATCTACACTGGGCCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGCCCGGGAACCGTGGGGGTGCAGCGTCTACACTGGGTCCGGGAACCGTGTGGGTGCAGCG_G:108,0,981
chr10   132792081       .       C       .       2105.86 PASS    END=132792094   GT:DP:GQ        0/0:22:61

This is a violation of our data model for GVCFs, and indicates that our sample is both 0/1 and 0/0 at the same position. Do you happen to know which variant caller was used here? That may help us figure out who to talk to about this.