Filtering variants from a hail matrix present in another hail matrix

emauryg · December 16, 2020, 2:44am

I’m trying to filter out the variants present in the controls from variants present in cases for a case control study.

right now I have the following:

mt_denovo_control = mt_denovo.filter_cols(mt_denovo.is_control == “control”)
mt_denovo_cases = mt_denovo.filter_cols(mt_denovo.is_control == “case”)

I have tried doing this:

mt_denovo_cases = mt_denovo_cases.filter_rows(hl.is_missing(mt_denovo_control.rows()[mt_denovo_cases.row_key]))

but that just removes all the variants from the cases which I don’t think is appropriate behavior.

Is there a way to just filter out the variants present in control samples from the cases hail matrix?

Thank you!

danking · December 16, 2020, 3:22am

Hey @emauryg!

I generally recommend working with a single matrix table rather than two distinct datasets. I realize that sometimes data isn’t sequenced or genotyped together though!

As to your question. You want to remove all variants in your “cases” dataset that are also present in the “control” dataset. The second code block you pasted does exactly that:

mt_denovo_cases = mt_denovo_cases.filter_rows(
    hl.is_missing(mt_denovo_control.rows()[mt_denovo_cases.row_key]))

If, after executing that, mt_denovo_cases.count() indicates no remaining rows, then the rows in your “cases” dataset are a subset of those in the “controls” dataset. I’d be curious of the output of these operations:

print(mt_denovo_cases.count())
print(mt_denovo_control.count())
# x.anti_join_rows(y) removes rows from x whose keys appear in y
print(mt_denovo_cases.anti_join_rows(mt_denovo_control.rows()).count())
print(mt_denovo_control.anti_join_rows(mt_denovo_cases.rows()).count())

If the last two statements both indicate there are zero rows, then it sounds like your datasets have the exact same set of rows. In that case, I think you’ll have an easier time working with your data if you combined these Matrix Tables into one Matrix Table:

mt_denovo_cases = mt_denovo_cases.annotate_cols(is_case = True)
mt_denovo_control = mt_denovo_control.annotate_cols(is_case = False)
mt = mt_denovo_cases.union_cols(mt_denovo_control)

If you want to only keep rows in which at least one case has at least one non-reference allele, try this:

mt = mt.filter_rows(
    hl.agg.any(
        mt.is_case & (mt.GT.n_alt_alleles() > 0)))

emauryg · December 16, 2020, 3:55am

With the command you showed I get the following

(550161, 11058)
(550161, 10518)
(0, 11058)
(0, 10518)

However, when I run the following, there are differences:

mt_denovo_cases.aggregate_entries(hl.agg.count_where(mt_denovo_cases.GT.is_non_ref()))

1793

mt_denovo_control.aggregate_entries(hl.agg.count_where(my_denovo_control.GT.is_non_ref()))
250

danking · December 16, 2020, 3:12pm

In Hail, the presence or absence of a row in a dataset is distinct from the presence or absence of genotypes with non-reference alleles in a row. However, you can make those notions equivalent by keeping only those rows that have at least one non-reference allele:

mt_denovo_cases = mt_denovo_cases.filter_rows(hl.agg.any(mt.GT.is_non_ref()), keep=True)
mt_denovo_controls = mt_denovo_controls.filter_rows(hl.agg.any(mt.GT.is_non_ref()), keep=True)

After this operation, I expect the latter two count operations will indicate a non-zero number of rows.

emauryg · December 16, 2020, 8:22pm

That seems more correct thank you!

2020-12-16 19:28:15 Hail: INFO: Ordering unsorted dataset with network shuffle

(328, 11058)

2020-12-16 19:38:40 Hail: INFO: Ordering unsorted dataset with network shuffle

(77, 10518)

2020-12-16 19:55:11 Hail: INFO: Ordering unsorted dataset with network shuffle 2020-12-16 19:58:34 Hail: INFO: Ordering unsorted dataset with network shuffle

(262, 11058)

2020-12-16 20:15:23 Hail: INFO: Ordering unsorted dataset with network shuffle 2020-12-16 20:18:49 Hail: INFO: Ordering unsorted dataset with network shuffle

(11, 10518)

danking · December 16, 2020, 8:33pm

Just a word of caution: “Ordering unsorted dataset with network shuffle” means Hail is doing possibly the slowest thing it can do to your data. Did your data not come from a file sorted by chromosome and position? You’ll experience dramatically better performance if you save your data as a Hail Table immediately after import from this unsorted source:

mt_denovo_cases.write('...')  # do slow re-ordering of variants once, re-use this file constantly
mt_denovo_cases = hl.read_matrix_table('...')
mt_denovo_cases = mt_denovo_cases.filter_rows(...)
mt_denovo_cases.count()  # much faster!

emauryg · December 17, 2020, 7:29pm

Thanks that does speed things up!

So to merge the hail matrices after the filtering above, is it ok to just use the mt.union_cols() function as you indicated above?

danking · December 17, 2020, 7:55pm

After the filtering, you cannot union_cols the matrix tables because they don’t have the same variants anymore. Note that in the output you shared the number of rows is different for cases and controls.

If you want a single MT, filtered to just those variants that have at least one non-reference allele in the cases, I recommend starting from the two original files and doing the following.

mt_denovo_cases = # import the case data
mt_denovo_cases.write('...')  # do slow re-ordering of variants once, re-use this file constantly
mt_denovo_cases = hl.read_matrix_table('...')

mt_denovo_control = # import the control data
mt_denovo_control.write('...')  # do slow re-ordering of variants once, re-use this file constantly
mt_denovo_control = hl.read_matrix_table('...')

# all data now converted to the efficient Hail MT format.

mt_denovo_cases = mt_denovo_cases.annotate_cols(is_case = True)
mt_denovo_control = mt_denovo_control.annotate_cols(is_case = False)
mt = mt_denovo_cases.union_cols(mt_denovo_control)
mt = mt.filter_rows(
    hl.agg.any(
        mt.is_case & (mt.GT.n_alt_alleles() > 0)))
print(mt.count())
mt.show()

emauryg · December 17, 2020, 7:59pm

Great thank you!

Topic		Replies	Views
Select certain samples from MatrixTable Hail Query & hailctl	9	825	October 6, 2022
Concat rows two matrix tables Hail Query & hailctl	5	442	December 8, 2023
Filter out variants on sex chromosomes from matrixtable Hail Query & hailctl	2	479	December 9, 2022
Filter variants based on other files Hail Query & hailctl	3	423	February 9, 2022
Filter_rows using row_key of a different source Hail Query & hailctl	5	870	March 8, 2019

Filtering variants from a hail matrix present in another hail matrix

Related topics