I’m new to hail 0.2 and I have a question/suggestion for the function filter_intervals().
I have a MatrixTable loaded (ds), and I want to filter based on a set of intervals from a bed file:
bed = hl.import_bed(anno_path)
ds_filter = hl.filter_intervals(ds, bed.interval)
However, running this I get the following error:
TypeError: filter_intervals: parameter ‘intervals’: expected expression of type array<interval>, found <IntervalExpression of type interval<locus>>
So it seems like this operation require that I run bed.interval.collect(), thus returning an array (I have about 2M intervals), and then running filter_intervals? This appears sub-optimal to me - is there any other way I’m supposed to be using this function?
Hi Brian, you’re using the method as intended. Our current implementation requires that the intervals be local on master because it builds and broadcasts an interval tree data structure. We had the one, few, or perhaps 20k interval use case in mind. Are your 2 million intervals tiny and disjoint or large and overlapping? How many variants are you starting with? How long does the collect take? How long does the filter take (if you get that far)?
@cseed is an implementation that handles this use case related to the interval join discussion?
bed[ds.locus] joins the dataset with the bed file based on locus / interval<locus> matches.
This is a StructExpression which might have no fields, but it will be defined if the locus was found in any interval, and missing if the locus is outside of all intervals.
Filtering to rows where that joined struct is defined will keep the variants where locus is in an interval.
I’m working on gnomAD for a while but just tried to use hail since I now need to parse some info on v3.
I succeed to install hail in a dedicated venv on a 64core / 1TB RAM cluster.
My goal is to retrieve all variations present between intervals listed in a BED file (~3k protein-coding genes + upstream & dowstream region (5 kbp)).
Before starting, first tried to retrieve a single position (chr11). When performing filter function, response is free of delay, but when I want to show the saved table, there are numerous stages (i didn’t figure to find to what it correspond in the docs) and it seems super slow to retrieve this single interval.
import hail as hl
gs = "/data/scratch/gnomAD/v3/gnomad.genomes.v3.1.sites.ht/"
data = hl.read_table(gs)
import pandas as pd
bed = pd.DataFrame([['chr11'], ,]).T
bed.to_csv('test.bed', sep='\t', header=False, index=False)
interval_table = hl.import_locus_intervals('test.bed', reference_genome='GRCh38')
data_lite = data.filter(hl.is_defined(interval_table[data.locus]))
I started python with following parameters relative to PYSPARK : PYSPARK_SUBMIT_ARGS="--driver-memory 700g --executor-memory 10g pyspark-shell" python
So my questions are :
What are those stages ? (it seem that no computation are used for minutes before shifting from one to the next)
Am I using the good procedure to extract variations between intervals ?
Thanks in advance for your help and thanks for your impressive work !
One of the important things to know here is that the instant return of data.filter(...) is because it isn’t doing any computation – Hail is lazy and executes full pipelines when you write to disk or return data to Python (like for show()).
There are two ways to filter using intervals. One is using a table (the way you’ve written this), one is using hl.filter_intervals. While we intend to improve our optimizer to make this transformation automatically, right now for small numbers of intervals, the filter_intervals strategy is much faster. It looks like this:
intervals = [hl.parse_locus_interval('17:38449840-38530994', reference_genome='GRCh38')] # or some other way to make a list of
ds_result = hl.filter_intervals(dataset, intervals)
It’s also possible to use filter / filter_rows syntax which optimizes to the same thing:
This will make a big difference. Your version has to read the entire table from disk, the filter_intervals approach will read a single row.
show() can run several successive stages, looking at exponentially increasing pieces of the front of the dataset until it finds the desired number of rows to show. This interacts badly with pipelines that execute big filters, since those pipelines require processing a large portion of the table to find a small number of rows.