After running sample_qc()
, I extract the samples in a Hail Table then apply some filters. For the sake of example here some hard-coded filters:
mt = hl.sample_qc(mt)
ht_qc = mt.cols()
ht_fail = ht_qc.filter(
(ht_qc.sample_qc.n_snp < 5200000)
| (ht_qc.sample_qc.n_singleton < 130000)
| (ht_qc.sample_qc.n_insertion < 550000)
| (ht_qc.sample_qc.n_deletion < 550000)
)
nfail = ht_fail.count()
ht_fail.s.collect()
That expression allows me to display the list of samples id that failed my filters.
I also wish to create a new table with only the samples that passed my filters…
I sould do
ht_pass = ht_qc.filter(
(ht_qc.sample_qc.n_snp < 5200000)
| (ht_qc.sample_qc.n_singleton < 130000)
| (ht_qc.sample_qc.n_insertion < 550000)
| (ht_qc.sample_qc.n_deletion < 550000)
, keep=false
)
but that require to write 2 times the same set of filters, that is error prone if I update ht_fail but do not synchronize ht_pass.
Q> Is there a smart way to declare one set of filters and gather both sets ?
soething like
myFilters = // complex boolean expression on table fields
ht_fail = ht_qc.filter(myFilters, keep=true)
ht_pass = ht_qc.filter(myFilters, keep=false)