here is my script
inputs=[‘/content/data_GWAS/HGDP00224.hgdp_wgs.20190516.phase10x.vcf.gz’,‘/content/data_GWAS/HGDP00228.hgdp_wgs.20190516.phase10x.vcf.gz’]
Ah! Some of this code is in a transition phase, and is a bit disorganized. You can add the alleles key in the run_combiner function all using key_by_locus_and_alleles=True
Yes, Google VMs wery slow.Yes, Google VMs very slow. It is only 2-core CPU,14 GB of RAM and 25Gb disc space. Actually, how mach CPU (GPU) core i have to reqwest to make the merging fuster?
I still have got a problem. I have download two VCF files from 1000 g project.
/ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516/gVCFs/
The VCF are
HGDP00001.hgdp_wgs.20190516.vcf.gz HGDP00003.hgdp_wgs.20190516.vcf.gz
When i trying to merg this two fils with comand
hl.experimental.run_combiner(inputs, out_file=output_file, tmp_path=temp_bucket,use_genome_default_intervals=True, key_by_locus_and_alleles=True, reference_genome=‘GRCh38’)
I get an error
however the VCF files have to be fine. How to resolv the problem with Hail? It shuld be a way to fix the VCF files? Could you sugest a source of VCF to test hl.experimental.run_combiner? I hawe to make it functional. Otherwise Hail is absolutely useless software because it impossible to import anything in the Hail.
I’ll take a look at these GVCFs later today to see what’s going on. I suspect that there is some violation of our assumed data model in these GVCFs – in particular, that there is only one record for any chr/position – but there might be a workaround.
It appears there are alleles with numeric representations (1, 2) which is particularly concerning.
There are two rows with the same locus in that second GVCF:
chr10 132792081 . CGGAACCGTGTGGGTGCAGCATCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGTCCCGGAACCGTGTGGGTACAGCGTCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCATCTACACTGGGCCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGCCCGGGAACCGTGGGGGTGCAGCGTCTACACTGGGTCCG C 321326 PASS VQSLOD=0.501;VQSRMODE=INDEL;ExcHet=7 GT:AD:DP:GQ:PGT:PID:PL 0/1:21,0,5:26:99:0|1:132792101_ATCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGTCCCGGAACCGTGTGGGTACAGCGTCTACACTGGGTCCGGGAACCGTGTGGGGGTGCAGCATCTACACTGGGCCCGGGAACCGTGTGGGGGTGCAGCGTCTACACTGGGCCCGGGAACCGTGGGGGTGCAGCGTCTACACTGGGTCCGGGAACCGTGTGGGTGCAGCG_G:108,0,981
chr10 132792081 . C . 2105.86 PASS END=132792094 GT:DP:GQ 0/0:22:61
This is a violation of our data model for GVCFs, and indicates that our sample is both 0/1 and 0/0 at the same position. Do you happen to know which variant caller was used here? That may help us figure out who to talk to about this.