Table filter expression

mhebrard · January 16, 2020, 5:52am

After running sample_qc(), I extract the samples in a Hail Table then apply some filters. For the sake of example here some hard-coded filters:

mt = hl.sample_qc(mt)
ht_qc = mt.cols()
ht_fail = ht_qc.filter(
    (ht_qc.sample_qc.n_snp < 5200000)
    | (ht_qc.sample_qc.n_singleton < 130000)
    | (ht_qc.sample_qc.n_insertion < 550000)
    | (ht_qc.sample_qc.n_deletion < 550000)
)
nfail = ht_fail.count()
ht_fail.s.collect()

That expression allows me to display the list of samples id that failed my filters.
I also wish to create a new table with only the samples that passed my filters…

I sould do

 ht_pass = ht_qc.filter(
    (ht_qc.sample_qc.n_snp < 5200000)
    | (ht_qc.sample_qc.n_singleton < 130000)
    | (ht_qc.sample_qc.n_insertion < 550000)
    | (ht_qc.sample_qc.n_deletion < 550000)
, keep=false
)

but that require to write 2 times the same set of filters, that is error prone if I update ht_fail but do not synchronize ht_pass.

Q> Is there a smart way to declare one set of filters and gather both sets ?

soething like

myFilters = // complex boolean expression on table fields
ht_fail = ht_qc.filter(myFilters, keep=true)
ht_pass = ht_qc.filter(myFilters, keep=false)

tpoterba · January 16, 2020, 1:53pm

The semi_join and anti_join methods will be helpful here:

ht_fail = ht_qc.filter(myFilters, keep=true)
ht_pass = ht_qc.anti_join(ht_fail)

mhebrard · January 17, 2020, 12:20am

Oh, good to know, thanks.

I actually change the way I store the filters result to do similar as how VCF filters field is encoded in Hail.
I generate one field that is a set of string that contains the names of the filters that failed. Then I can test the length of the set …

ht_pass = ht_qc.filter(hl.len(ht_qc.my_filters) == 0)
ht_fail = ht_qc.filter(hl.len(ht_qc.my_filters) > 0)

Topic		Replies	Views
Filter samples from MatrixTable Hail Query & hailctl	8	669	October 22, 2021
Simple question about mt.filter_rows() Hail Query & hailctl	0	408	October 7, 2021
Filter mt rows on ht error Hail Query & hailctl	10	660	April 7, 2020
Issues with sample and variant QC by group Hail Query & hailctl	9	1162	May 14, 2020
More efficient way to extract calls? Hail Query & hailctl	2	370	December 14, 2022

Table filter expression

Related topics