Hi Hail Community,
We appear to be having an issue with using Hail to filter a larger vcf file (ie. the full gvcf file) based on a BED file.
When we take a smaller vcf file (say 600mb), we do not appear to have any issues - the vcf returned is correct as per the filtering from the BED file with the correct number of CHROM/POS and the data is as per source file.
When we use a larger gvcf file (ie. 6.5GB), we start having 2 issues:
- A number of the rows have strange information added - this is an example:
a. Original data from vcf file:
chr1 97450065 . T <NON_REF> . PASS END=97450065
GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT 0/0:38,1:39:75:39:0,75,1519:0,75,255:35,0
b. data for that CHROM/POS after performing filter of vcf file with hail on large file:
chr1 97450065 . T <NON_REF> -10.00 PASS END=97450065
GT:AD:AF:DP:F1R2:F2R1:GP:GQ:ICNT:MB:MIN_DP:PL:PRI:PS:SB:SPL:SQ
0/0:38,1:.:39:.:.:.:75:35,0:.:39:0,75,1519:.:.:.:0,75,255
- The number of rows are not correct. We have verified this a number of times with bcftools and manual checking. There is a big difference between what Hail produces (93 rows filtered) and what the expected/correct based on the BED file (271 rows should have been filtered)
What could be causing this problem?
Note that we have checked our source bgz file incase there was some king of corruption. The data is all correct prior to processing in Hail.
To help, here is the function we are using:
import hail as hl
def test():
contig_recoding = {
‘1’: ‘chr1’,
‘2’: ‘chr2’,
‘3’: ‘chr3’,
‘4’: ‘chr4’,
‘5’: ‘chr5’,
‘6’: ‘chr6’,
‘7’: ‘chr7’,
‘8’: ‘chr8’,
‘9’: ‘chr9’,
‘10’: ‘chr10’,
‘11’: ‘chr11’,
‘12’: ‘chr12’,
‘13’: ‘chr13’,
‘14’: ‘chr14’,
‘15’: ‘chr15’,
‘16’: ‘chr16’,
‘17’: ‘chr17’,
‘18’: ‘chr18’,
‘19’: ‘chr19’,
‘20’: ‘chr20’,
‘21’: ‘chr21’,
‘22’: ‘chr22’,
‘X’: ‘chrX’,
‘Y’: ‘chrY’,
‘MT’: ‘chrMT’,
}
hl.init()
vds=hl.import_vcf(‘WGS-300-1.hard-filtered.vcf.bgz’,reference_genome=‘GRCh38’)
bed = hl.import_bed(‘test.bed’,reference_genome=‘GRCh38’, contig_recoding = contig_recoding, skip_invalid_intervals=True)
filtered_variants = vds.filter_rows(hl.is_defined(bed[vds.locus]), keep=True)
hl.export_vcf(filtered_variants, ‘output/example.vcf.bgz’)
test()