Problems with filtering a large (ie. 6.5GB) gvcf file based on BED file with Hail

Hi Hail Community,

We appear to be having an issue with using Hail to filter a larger vcf file (ie. the full gvcf file) based on a BED file.

When we take a smaller vcf file (say 600mb), we do not appear to have any issues - the vcf returned is correct as per the filtering from the BED file with the correct number of CHROM/POS and the data is as per source file.

When we use a larger gvcf file (ie. 6.5GB), we start having 2 issues:

  1. A number of the rows have strange information added - this is an example:

a. Original data from vcf file:
chr1 97450065 . T <NON_REF> . PASS END=97450065
GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:38,1:39:75:39:0,75,1519:0,75,255:35,0

b. data for that CHROM/POS after performing filter of vcf file with hail on large file:
chr1 97450065 . T <NON_REF> -10.00 PASS END=97450065
GT:AD:AF:DP:F1R2:F2R1:GP:GQ:ICNT:MB:MIN_DP:PL:PRI:PS:SB:SPL:SQ
0/0:38,1:.:39:.:.:.:75:35,0:.:39:0,75,1519:.:.:.:0,75,255

  1. The number of rows are not correct. We have verified this a number of times with bcftools and manual checking. There is a big difference between what Hail produces (93 rows filtered) and what the expected/correct based on the BED file (271 rows should have been filtered)

What could be causing this problem?

Note that we have checked our source bgz file incase there was some king of corruption. The data is all correct prior to processing in Hail.

To help, here is the function we are using:

import hail as hl

def test():
contig_recoding = {
‘1’: ‘chr1’,
‘2’: ‘chr2’,
‘3’: ‘chr3’,
‘4’: ‘chr4’,
‘5’: ‘chr5’,
‘6’: ‘chr6’,
‘7’: ‘chr7’,
‘8’: ‘chr8’,
‘9’: ‘chr9’,
‘10’: ‘chr10’,
‘11’: ‘chr11’,
‘12’: ‘chr12’,
‘13’: ‘chr13’,
‘14’: ‘chr14’,
‘15’: ‘chr15’,
‘16’: ‘chr16’,
‘17’: ‘chr17’,
‘18’: ‘chr18’,
‘19’: ‘chr19’,
‘20’: ‘chr20’,
‘21’: ‘chr21’,
‘22’: ‘chr22’,
‘X’: ‘chrX’,
‘Y’: ‘chrY’,
‘MT’: ‘chrMT’,
}
hl.init()

vds=hl.import_vcf(‘WGS-300-1.hard-filtered.vcf.bgz’,reference_genome=‘GRCh38’)

bed = hl.import_bed(‘test.bed’,reference_genome=‘GRCh38’, contig_recoding = contig_recoding, skip_invalid_intervals=True)
filtered_variants = vds.filter_rows(hl.is_defined(bed[vds.locus]), keep=True)
hl.export_vcf(filtered_variants, ‘output/example.vcf.bgz’)

test()

Hey @biointeltech , I’m sorry you’re having trouble. Let’s resolve that.

Issue 1

I can’t reproduce the field issue. I used this as /tmp/foo.vcf:

##fileformat=VCFv4.2
##INFO=<ID=END,Number=1,Type=Integer,Description="">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="MIN_DP">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype likelihoods">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="PL">
##FORMAT=<ID=SPL,Number=G,Type=Float,Description="SPL">
##FORMAT=<ID=ICNT,Number=A,Type=Float,Description="ICNT">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	s1
chr1	97450065	.	T	<NON_REF>	.	PASS	END=97450065	GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT	0/0:38,1:39:75:39:0,75,1519:0,75,255:35,0

I execute this:

python3 -c 'import hail as hl; hl.export_vcf(hl.import_vcf("/tmp/foo.vcf", reference_genome='GRCh38'), "/tmp/foo-out.vcf")

And the last line is:

(base) dking@wm28c-761 /tmp % tail -n 1 /tmp/foo-out.vcf 
chr1	97450065	.	T	<NON_REF>	-10.00	PASS	END=97450065	GT:AD:GQ:DP:MIN_DP:GP:GL:PL:SPL:ICNT	0/0:38,1:75:39:39:.:.:0.00000,75.0000,1519.00:0.00000,75.0000,255.000:35.0000,0.00000

Can you provide a small example file and code that produces unexpected results?

Issue 2

Hail’s importing and filtering is extensively tested so I’m surprised there is an issue here. Unfortunately, I cannot debug further without a small example file that reproduces the unexpected behavior. For example, I need a VCF with two lines, a BED file that should filter one of them but not the other, and code that produces a VCF with the wrong lines (e.g. both, neither, or the wrong one).


What version of Hail are you using? I am using 0.2.117.

On the issue 1: We are getting a consistent result to you (we are also on the same version) - the information seems to be updated when we use the Hail filter_rows function. Specifically, -10.00 is added in “QUAL”; the “GQ:DP” values are swapped and a bunch of decimal places added to all the numbers (this last one I am not so concerned with):

The input data: chr1 97450065 . T <NON_REF> . PASS END=97450065 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:38,1:39:75:39:0,75,1519:0,75,255:35,0
Output data: chr1 97450065 . T <NON_REF> -10.00 PASS END=97450065 GT:AD:GQ:DP:MIN_DP:GP:GL:PL:SPL:ICNT 0/0:38,1:75:39:39:.:.:0.00000,75.0000,1519.00:0.00000,75.0000,255.000:35.0000,0.00000

On the issue 2:
Here are the input vcf files (with 3 lines of data), the BED file that we used to filter and the output file. The output file should contain all rows in the input file, but we only got 2 rows. The “chr1 201091971” should match the range “chr1 201091992 201091993” in the BED file.

Input rows:

chr1 11794419 rs1801131 T G,<NON_REF> 49.15 PASS DB;DP=23;FractionInformativeReads=1.00000;MQ=250.000;MQRankSum=4.00000;ReadPosRankSum=2.63600 GT:AD:AF:DP:F1R2:F2R1:GP:GQ:ICNT:MB:MIN_DP:PL:PRI:PS:SB:SPL:SQ 0|1:13,10,0:0.435000,0.00000:23:5,4,0:8,6,0:49.1480,7.46050e-05,53.0000,387.620,526.660,882.510:48:14,0:6,7,8,2:.:84,0,50,388,492,880:0.00000,34.7700,37.7700,34.7700,69.5400,37.7700:11794400:8,5,4,6:255,0,255

chr1 97450066 . G <NON_REF> -10.00 PASS END=97450076 GT:AD:AF:DP:F1R2:F2R1:GP:GQ:ICNT:MB:MIN_DP:PL:PRI:PS:SB:SPL:SQ 0/0:38,0:.:38:.:.:.:99:36,0:.:36:0,99,1504:.:.:.:0,99,255

chr1 201091971 . C <NON_REF> . PASS END=201092021 GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:32,0:32:81:29:0,81,1159:0,81,255:32,0

Output rows:

chr1 11794419 rs1801131 T G,<NON_REF> 49.15 PASS DB;DP=23;FractionInformativeReads=1.00000;MQ=250.000;MQRankSum=4.00000;ReadPosRankSum=2.63600 GT:AD:AF:DP:F1R2:F2R1:GP:GQ:ICNT:MB:MIN_DP:PL:PRI:PS:SB:SPL:SQ 0|1:13,10,0:0.435000,0.00000:23:5,4,0:8,6,0:49.1480,7.46050e-05,53.0000,387.620,526.660,882.510:48:14,0:6,7,8,2:.:84,0,50,388,492,880:0.00000,34.7700,37.7700,34.7700,69.5400,37.7700:11794400:8,5,4,6:255,0,255

chr1 97450066 . G <NON_REF> -10.00 PASS END=97450076 GT:AD:AF:DP:F1R2:F2R1:GP:GQ:ICNT:MB:MIN_DP:PL:PRI:PS:SB:SPL:SQ

BED rows:

chr1 201060814 201060815
chr1 201091992 201091993
chr1 97079070 97079071
chr1 97082390 97082391
chr1 97098597 97098598
chr1 97305363 97305364
chr1 97450057 97450058
chr1 97450058 97450059
chr1 97450064 97450066
chr1 97450067 97450068
chr1 97515786 97515787
chr1 97515838 97515839
chr1 97515864 97515865
chr1 97573862 97573863
chr1 97573942 97573943
chr1 97579892 97579893
chr1 97593342 97593343
chr1 97691775 97691776
chr1 97699473 97699474
chr1 97699505 97699506
chr1 97699534 97699535
chr1 97740409 97740414
chr1 97883328 97883329
chr1 97883351 97883352
chr1 97883352 97883353
chr1 169549810 169549811
chr1 11794418 11794419
chr1 11796320 11796321
chr1 28818675 28818676