Hail 0.2 - filter intervals from a bed file

Hi,

I’m new to hail 0.2 and I have a question/suggestion for the function filter_intervals().

I have a MatrixTable loaded (ds), and I want to filter based on a set of intervals from a bed file:

bed = hl.import_bed(anno_path)
ds_filter = hl.filter_intervals(ds, bed.interval)

However, running this I get the following error:

TypeError: filter_intervals: parameter ‘intervals’: expected expression of type array<interval>, found <IntervalExpression of type interval<locus>>

So it seems like this operation require that I run bed.interval.collect(), thus returning an array (I have about 2M intervals), and then running filter_intervals? This appears sub-optimal to me - is there any other way I’m supposed to be using this function?

Hi Brian, you’re using the method as intended. Our current implementation requires that the intervals be local on master because it builds and broadcasts an interval tree data structure. We had the one, few, or perhaps 20k interval use case in mind. Are your 2 million intervals tiny and disjoint or large and overlapping? How many variants are you starting with? How long does the collect take? How long does the filter take (if you get that far)?

@cseed is an implementation that handles this use case related to the interval join discussion?

This is the better way to do it:

ds_filter = ds.filter_rows(hl.is_defined(bed[ds.locus]))

We haven’t really documented the 0.2 join stuff, that should be done with high priority now.

Just to unpack the above –

bed[ds.locus] joins the dataset with the bed file based on locus / interval<locus> matches.
This is a StructExpression which might have no fields, but it will be defined if the locus was found in any interval, and missing if the locus is outside of all intervals.

Filtering to rows where that joined struct is defined will keep the variants where locus is in an interval.

Thanks @tpoterba, I hadn’t realized the interval joins discussion had already progressed to an interval joins implementation!

Cool, thank you for the prompt reply!

Hey, I have a similar request on a table not a matrix table.

ht = hl.read_table(‘data/gnomad.genomes.v3.1.sites.ht/’)
bed = hl.import_bed(‘data/intervals.bed’, reference_genome=‘GRCh38’)

I have tried several options, but most only apply on matrix tables. Is there a comprehensive guide somewhere on how to subset the gnomad (or any other) table ?

Thanks!

Above here’s how we filter matrix tables:

The following should work for tables:

ht_filtered = ht.filter(hl.is_defined(bed[ht.locus]))

It’s also possible to use a terser syntax we’ve since added:

ht_filtered = ht.semi_join_rows(bed)

Hm, still had problems, probably due to formatting of the imported bed.
This is how I solved it (according to https://workshop.hail.is/resources ):

bed = hl.import_table(‘bed/CTR7.GRCh38.chr.bed’, impute=True)

bed = bed.transmute(
interval = hl.locus_interval(bed.chromosome,
bed.start,
bed.end,reference_genome=‘GRCh38’))

keyed_gene_table = bed.key_by(‘interval’)

ht_filtered = hl.filter_intervals(ht, keyed_gene_table[‘interval’].collect())