Table filter expression

After running sample_qc(), I extract the samples in a Hail Table then apply some filters. For the sake of example here some hard-coded filters:

mt = hl.sample_qc(mt)
ht_qc = mt.cols()
ht_fail = ht_qc.filter(
    (ht_qc.sample_qc.n_snp < 5200000)
    | (ht_qc.sample_qc.n_singleton < 130000)
    | (ht_qc.sample_qc.n_insertion < 550000)
    | (ht_qc.sample_qc.n_deletion < 550000)
)
nfail = ht_fail.count()
ht_fail.s.collect()

That expression allows me to display the list of samples id that failed my filters.
I also wish to create a new table with only the samples that passed my filters…

I sould do

 ht_pass = ht_qc.filter(
    (ht_qc.sample_qc.n_snp < 5200000)
    | (ht_qc.sample_qc.n_singleton < 130000)
    | (ht_qc.sample_qc.n_insertion < 550000)
    | (ht_qc.sample_qc.n_deletion < 550000)
, keep=false
)

but that require to write 2 times the same set of filters, that is error prone if I update ht_fail but do not synchronize ht_pass.

Q> Is there a smart way to declare one set of filters and gather both sets ?

soething like

myFilters = // complex boolean expression on table fields
ht_fail = ht_qc.filter(myFilters, keep=true)
ht_pass = ht_qc.filter(myFilters, keep=false)

The semi_join and anti_join methods will be helpful here:

ht_fail = ht_qc.filter(myFilters, keep=true)
ht_pass = ht_qc.anti_join(ht_fail)
1 Like

Oh, good to know, thanks.

I actually change the way I store the filters result to do similar as how VCF filters field is encoded in Hail.
I generate one field that is a set of string that contains the names of the filters that failed. Then I can test the length of the set …

ht_pass = ht_qc.filter(hl.len(ht_qc.my_filters) == 0)
ht_fail = ht_qc.filter(hl.len(ht_qc.my_filters) > 0)