Is there any way to filter variants from a GVCF file based on BED file

Hi Hail community,
I’m using Hail to filter variants from GVCF file based on a BED file. The issue is the returned results if I used VCF file was completely correct, but it wasn’t correct when I used GVCF file.
My code

  vds=hl.import_vcf('/path/gvcf/example.hard- 
  filtered.gvcf.bgz',reference_genome='GRCh38')
  
  bed = hl.import_bed('/path/test5.bed',reference_genome='GRCh38', 
  contig_recoding = contig_recoding, skip_invalid_intervals=True)
  
  filtered_variants = vds.filter_rows(hl.is_defined(bed[vds.locus]))

I checked the document and found that the locus information contains 3 information: contig(chrom), position and reference and I believe that it’s the root cause because with the GVCF file we still have END position information that defined a row data as a range (Hail interval).

I have tried to create another column called new_interval with Hail.Interval type contains chrom, pos, and END position information for the vds matrix table, but I don’t know how to filter rows data from this matrix table using BED file based on new_interval column.
Is there any way to achieve this approach?
Thanks

Hi @biodev,
If I understand correctly, I think you want to import the gvcf as a hail VDS, which understands the reference blocks/intervals. Importing a single gvcf is an unusual use case, I think the only way to do that at the moment is using the VariantDatasetCombiner to “combine” one file.

Once you have the vds, `hl.vds.filter_intervals(vds, bed) will perform the filter while correctly handling reference data.

Hi @patrick-schultz,
Sorry for any confusing,
Currently I’m using import_vcf function to import my GVCF file, build a hail VDS on this and there is no problems.
The exact problem that I’m facing is: 'Filtering GVCF data from a BED file’ using this following code:

filtered_variants = vds.filter_rows(hl.is_defined(bed[vds.locus]))

I have used this code to filter data from a VCF file, it worked perfectly. But it was wrong when I used this code for GVCF file

The root cause that I thought:
Because each line in VCF file is just a single data and it doesn’t have the END position as the one in GVCF. With GVCF we have one Interval (block) (created by chrom, position, END) per line, so the vds.locus object Locus created by chrom & position only now will be wrong.

My solution:
Create another column called new_interval with Hail.Interval data type that is built from chrom, position, and END for the GVCF vds matrix table by using annotate_rows function, then use this column to compare against intervals in the BED file. Is it possible because we have failed many times? Here is my code for this solution

mt_annotated = vds.annotate_rows(new_interval=hl.locus_interval(vds.locus.contig, vds.locus.position, vds.info.END, reference_genome='GRCh38'))
filtered_mt = hl.filter_intervals(mt_annotated, bed['interval'].collect())

Should I use VariantDatasetCombiner instead?

Thanks

Finally, I solved by creating new_interval column and then filter based on this column instead of locus column

  mt_annotated = vds.annotate_rows(new_interval=hl.locus_interval(vds.locus.contig, vds.locus.position, hl.if_else(hl.is_defined(vds.info.END), vds.info.END, vds.locus.position), reference_genome='GRCh38', includes_end=True))
  mt_filtered = mt_annotated.filter_rows(hl.any(lambda interval: mt_annotated.new_interval.overlaps(interval), bed['interval'].collect()))```