Filtering variants from a hail matrix present in another hail matrix

I’m trying to filter out the variants present in the controls from variants present in cases for a case control study.

right now I have the following:

mt_denovo_control = mt_denovo.filter_cols(mt_denovo.is_control == “control”)
mt_denovo_cases = mt_denovo.filter_cols(mt_denovo.is_control == “case”)

I have tried doing this:

mt_denovo_cases = mt_denovo_cases.filter_rows(hl.is_missing(mt_denovo_control.rows()[mt_denovo_cases.row_key]))

but that just removes all the variants from the cases which I don’t think is appropriate behavior.

Is there a way to just filter out the variants present in control samples from the cases hail matrix?

Thank you!

Hey @emauryg!

I generally recommend working with a single matrix table rather than two distinct datasets. I realize that sometimes data isn’t sequenced or genotyped together though!

As to your question. You want to remove all variants in your “cases” dataset that are also present in the “control” dataset. The second code block you pasted does exactly that:

mt_denovo_cases = mt_denovo_cases.filter_rows(
    hl.is_missing(mt_denovo_control.rows()[mt_denovo_cases.row_key]))

If, after executing that, mt_denovo_cases.count() indicates no remaining rows, then the rows in your “cases” dataset are a subset of those in the “controls” dataset. I’d be curious of the output of these operations:

print(mt_denovo_cases.count())
print(mt_denovo_control.count())
# x.anti_join_rows(y) removes rows from x whose keys appear in y
print(mt_denovo_cases.anti_join_rows(mt_denovo_control.rows()).count())
print(mt_denovo_control.anti_join_rows(mt_denovo_cases.rows()).count())

If the last two statements both indicate there are zero rows, then it sounds like your datasets have the exact same set of rows. In that case, I think you’ll have an easier time working with your data if you combined these Matrix Tables into one Matrix Table:

mt_denovo_cases = mt_denovo_cases.annotate_cols(is_case = True)
mt_denovo_control = mt_denovo_control.annotate_cols(is_case = False)
mt = mt_denovo_cases.union_cols(mt_denovo_control)

If you want to only keep rows in which at least one case has at least one non-reference allele, try this:

mt = mt.filter_rows(
    hl.agg.any(
        mt.is_case & (mt.GT.n_alt_alleles() > 0)))

With the command you showed I get the following

(550161, 11058)
(550161, 10518)
(0, 11058)
(0, 10518)

However, when I run the following, there are differences:

mt_denovo_cases.aggregate_entries(hl.agg.count_where(mt_denovo_cases.GT.is_non_ref()))

1793

mt_denovo_control.aggregate_entries(hl.agg.count_where(my_denovo_control.GT.is_non_ref()))
250

In Hail, the presence or absence of a row in a dataset is distinct from the presence or absence of genotypes with non-reference alleles in a row. However, you can make those notions equivalent by keeping only those rows that have at least one non-reference allele:

mt_denovo_cases = mt_denovo_cases.filter_rows(hl.agg.any(mt.GT.is_non_ref()), keep=True)
mt_denovo_controls = mt_denovo_controls.filter_rows(hl.agg.any(mt.GT.is_non_ref()), keep=True)

After this operation, I expect the latter two count operations will indicate a non-zero number of rows.

That seems more correct thank you!

2020-12-16 19:28:15 Hail: INFO: Ordering unsorted dataset with network shuffle

(328, 11058)

2020-12-16 19:38:40 Hail: INFO: Ordering unsorted dataset with network shuffle

(77, 10518)

2020-12-16 19:55:11 Hail: INFO: Ordering unsorted dataset with network shuffle 2020-12-16 19:58:34 Hail: INFO: Ordering unsorted dataset with network shuffle

(262, 11058)

2020-12-16 20:15:23 Hail: INFO: Ordering unsorted dataset with network shuffle 2020-12-16 20:18:49 Hail: INFO: Ordering unsorted dataset with network shuffle

(11, 10518)

Just a word of caution: “Ordering unsorted dataset with network shuffle” means Hail is doing possibly the slowest thing it can do to your data. Did your data not come from a file sorted by chromosome and position? You’ll experience dramatically better performance if you save your data as a Hail Table immediately after import from this unsorted source:

mt_denovo_cases.write('...')  # do slow re-ordering of variants once, re-use this file constantly
mt_denovo_cases = hl.read_matrix_table('...')
mt_denovo_cases = mt_denovo_cases.filter_rows(...)
mt_denovo_cases.count()  # much faster!

Thanks that does speed things up!

So to merge the hail matrices after the filtering above, is it ok to just use the mt.union_cols() function as you indicated above?

After the filtering, you cannot union_cols the matrix tables because they don’t have the same variants anymore. Note that in the output you shared the number of rows is different for cases and controls.

If you want a single MT, filtered to just those variants that have at least one non-reference allele in the cases, I recommend starting from the two original files and doing the following.

mt_denovo_cases = # import the case data
mt_denovo_cases.write('...')  # do slow re-ordering of variants once, re-use this file constantly
mt_denovo_cases = hl.read_matrix_table('...')

mt_denovo_control = # import the control data
mt_denovo_control.write('...')  # do slow re-ordering of variants once, re-use this file constantly
mt_denovo_control = hl.read_matrix_table('...')

# all data now converted to the efficient Hail MT format.

mt_denovo_cases = mt_denovo_cases.annotate_cols(is_case = True)
mt_denovo_control = mt_denovo_control.annotate_cols(is_case = False)
mt = mt_denovo_cases.union_cols(mt_denovo_control)
mt = mt.filter_rows(
    hl.agg.any(
        mt.is_case & (mt.GT.n_alt_alleles() > 0)))
print(mt.count())
mt.show()

Great thank you!