I generally recommend working with a single matrix table rather than two distinct datasets. I realize that sometimes data isn’t sequenced or genotyped together though!
As to your question. You want to remove all variants in your “cases” dataset that are also present in the “control” dataset. The second code block you pasted does exactly that:
If, after executing that, mt_denovo_cases.count() indicates no remaining rows, then the rows in your “cases” dataset are a subset of those in the “controls” dataset. I’d be curious of the output of these operations:
print(mt_denovo_cases.count())
print(mt_denovo_control.count())
# x.anti_join_rows(y) removes rows from x whose keys appear in y
print(mt_denovo_cases.anti_join_rows(mt_denovo_control.rows()).count())
print(mt_denovo_control.anti_join_rows(mt_denovo_cases.rows()).count())
If the last two statements both indicate there are zero rows, then it sounds like your datasets have the exact same set of rows. In that case, I think you’ll have an easier time working with your data if you combined these Matrix Tables into one Matrix Table:
In Hail, the presence or absence of a row in a dataset is distinct from the presence or absence of genotypes with non-reference alleles in a row. However, you can make those notions equivalent by keeping only those rows that have at least one non-reference allele:
Just a word of caution: “Ordering unsorted dataset with network shuffle” means Hail is doing possibly the slowest thing it can do to your data. Did your data not come from a file sorted by chromosome and position? You’ll experience dramatically better performance if you save your data as a Hail Table immediately after import from this unsorted source:
mt_denovo_cases.write('...') # do slow re-ordering of variants once, re-use this file constantly
mt_denovo_cases = hl.read_matrix_table('...')
mt_denovo_cases = mt_denovo_cases.filter_rows(...)
mt_denovo_cases.count() # much faster!
After the filtering, you cannot union_cols the matrix tables because they don’t have the same variants anymore. Note that in the output you shared the number of rows is different for cases and controls.
If you want a single MT, filtered to just those variants that have at least one non-reference allele in the cases, I recommend starting from the two original files and doing the following.
mt_denovo_cases = # import the case data
mt_denovo_cases.write('...') # do slow re-ordering of variants once, re-use this file constantly
mt_denovo_cases = hl.read_matrix_table('...')
mt_denovo_control = # import the control data
mt_denovo_control.write('...') # do slow re-ordering of variants once, re-use this file constantly
mt_denovo_control = hl.read_matrix_table('...')
# all data now converted to the efficient Hail MT format.
mt_denovo_cases = mt_denovo_cases.annotate_cols(is_case = True)
mt_denovo_control = mt_denovo_control.annotate_cols(is_case = False)
mt = mt_denovo_cases.union_cols(mt_denovo_control)
mt = mt.filter_rows(
hl.agg.any(
mt.is_case & (mt.GT.n_alt_alleles() > 0)))
print(mt.count())
mt.show()