From a matrix table, how do I remove rows (variants) with a “NA” in any one of the samples? Thank you!
Hi @CuriousGeneticist !
From a matrix table, how do I remove rows with an NA in any one of the samples?
We can take your sentence piece-by-piece from English to Hail!
From a matrix table,
This tells us that we’ll be using MatrixTable methods like annotate_rows
, filter_rows
, annotate_cols
, etc. instead of the Table methods annotate
and filter
.
remove rows (variants) with
Removal of rows or columns in Hail is called “filtering”, so we want filter_rows
:
mt = mt.filter_rows( some_condition )
a “NA” in any one of the samples
This has three parts “a ‘NA’”, “any one of”, and “the samples” which correspond to:
hl.is_missing( some_thing_that_can_be_missing )
hl.agg.any( some_other_condition )
mt.GT
,mt.AD
, … (the various genotype fields which vary per-sample)
I will assume you want to look for missing genotypes (mt.GT
).
mt = mt.filter_rows(
hl.agg.any(
hl.is_missing(mt.GT)
)
)
An important caveat! Hail has both “missing” entry fields and “filtered entries”. Filtered entries are created by filter_entries
. You never get filtered entries from import_vcf
. You can check if you have any filtered entries by running this command:
mt = mt.compute_entry_filter_stats()
mt.entry_stats_row.n_filtered.summarize()
Thank you so much for your response Danking. I tried to run this code:
mt = mt.filter_rows(
- hl.agg.any(*
-
hl.is_missing(mt.GT)*
- )*
)
But the NA rows are still there. Does that mean that it is a “filtered entry”?
But the NA rows are still there
What is the query you are running to see the “NA rows”? This might help us best answer.
I used
mt.GT.show()
and I still saw the “NA” rows.
Can you copy and paste or screenshot an example row from mt.GT.show()
? Also what is printed by my suggestion above for checking for filtered entries?
Actually, I realized that it was an issue with an upstream query. Thanks so much for your help!