Using filter_rows parameter to generate new gvcf file without filtered rows

Hi Hail Community,

I am new to using the Hail Py libraries. I am trying to use the filter_rows parameter to generate a filter .gvcf file using a list of CHROM/POS IDs from the BED file. I am getting some errors associated to the “chr” that is normally put in front of the #CHROM in the gvcf file.

My desired output should be a .gvcf file without the rows we have filtered from the BED file (list of CHROM and POS IDs).

I am using the following function:

import hail as hl
hl.init()
vds=hl.import_vcf(‘WGS-800-20-hard-filtered.vcf.bgz’,reference_genome=‘GRCh38’)
bed = hl.import_bed(‘with-chr-prefix.bed’,reference_genome=‘GRCh38’)
filtered_variants = vds.filter_rows(hl.is_defined(bed[vds.locus]))
hl.export_vcf(filtered_variants, ‘output/example.vcf.bgz’)

I have tried amending my BED file to have the “chr” and not have the “chr”

  1. Sample of a row from the BED file with the “chr”:
    chr1 201060815 201060816
    chr1 201091993 201091994

In this case, I get the following error:
HailException: Invalid interval ‘[chr22:94761901-chr22:94761902)’ found. Start ‘chr22:94761901’ is not within the range [1-50818468] for reference genome ‘GRCh38’

  1. Sample of a row from the BED file without the “chr”:
    1 201060815 201060816
    1 201091993 201091994

In this case, I get the following error:
Invalid interval ‘[1:201060816-1:201060817)’ found. Contig ‘1’ is not in the reference genome ‘GRCh38’.

What am I doing wrong here?

Here is also a row of data from the .gvcf file produced from an illumina instrument:
chr1 896798 rs13302934 A G 139.97 PASS AC=2;AF=1.000;AN=2;DP=32;FS=0.000;MQ=101.33;QD=4.37;SOR=0.693;FractionInformativeReads=1.000;DB GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 1/1:0,32:1.000:32:0,17:0,15:93:178,96,0:1.3997e+02,9.3315e+01,2.0242e-09:0.00,34.77,37.77:0,0,16,16:0,0,13,19

Hi - as an update to the above. We managed to resolve the the error that we described above: “Invalid interval ‘[1:201060816-1:201060817)’ found. Contig ‘1’ is not in the reference genome ‘GRCh38’.”

We believe that it was resolved by setting the parameter skip_invalid_intervals = True - our understanding is that this means that we’ll skip any region that is considered as an invalid range in GRCh38 reference genome.

This has resulted in a valid formatted vcf file but only returned 2 SNP rows of data. When we run the same thing with the command line tool bcftools, we get over 40 rows.

Any ideas what we are still doing wrong?

Here is the function we are now using:

import hail as hl

def test():
contig_recoding = {
‘1’: ‘chr1’,
‘2’: ‘chr2’,
‘3’: ‘chr3’,
‘4’: ‘chr4’,
‘5’: ‘chr5’,
‘6’: ‘chr6’,
‘7’: ‘chr7’,
‘8’: ‘chr8’,
‘9’: ‘chr9’,
‘10’: ‘chr10’,
‘11’: ‘chr11’,
‘12’: ‘chr12’,
‘13’: ‘chr13’,
‘14’: ‘chr14’,
‘15’: ‘chr15’,
‘16’: ‘chr16’,
‘17’: ‘chr17’,
‘18’: ‘chr18’,
‘19’: ‘chr19’,
‘20’: ‘chr20’,
‘21’: ‘chr21’,
‘22’: ‘chr22’,
‘X’: ‘chrX’,
‘Y’: ‘chrY’,
‘MT’: ‘chrMT’,
}
hl.init()

vds=hl.import_vcf(‘WGS-300-1.hard-filtered.vcf.bgz’,reference_genome=‘GRCh38’)

bed = hl.import_bed(‘test.bed’,reference_genome=‘GRCh38’, contig_recoding = contig_recoding, skip_invalid_intervals=True)
filtered_variants = vds.filter_rows(hl.is_defined(bed[vds.locus]), keep=True)
hl.export_vcf(filtered_variants, ‘output/example.vcf.bgz’)

test()

This just ignores anything that is an invalid GRCh38 Locus, not just things that are out of bounds. You probably shouldn’t use that parameter.

You have to figure out first why this isn’t working:

contig_recoding = {
  **{str(i): 'chr' + str(i) for i in range(1, 23)},
  'X': 'chrX',
  'Y': 'chrY',
  'MT': 'chrMT',
}
bed = hl.import_bed('test.bed', reference_genome='GRCh38', contig_recoding = contig_recoding)
bed.collect()

If that is giving you an error about contigs, that’s a bug in Hail. If you share a one-line BED file that fails, we can fix that.

On the other hand, if this fails, then the issue is not the BED file but the VCF. You need to supply a contig_recoding here as well.

mt = hl.import_vcf('WGS-300-1.hard-filtered.vcf.bgz', reference_genome='GRCh38')
mt.count()

I’ll note, generally, Hail is not particularly fast at importing VCF text files and then immediately exporting VCF text files. You’ll have a much better experience if you use the Hail library for your entire analysis.

Thanks so much for the useful suggestions. Our problem was actually invalid CHROM/POS data to GRCh38. Once we removed/fixed those rows, it all worked.

1 Like