Concat rows two matrix tables

Hi hail community!
I would like to concat two matrix tables, for simplicity, let’s call them mt1 and mt2.
mt1 has the following structure:

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'varid': str
----------------------------------------
Entry fields:
    'GT': call
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

While mt2:

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'fam_id': str
    'pat_id': str
    'mat_id': str
    'is_female': bool
    'is_case': bool
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'cm_position': float64
----------------------------------------
Entry fields:
    'GT': call
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

What I want is:

1. Discard all the samples (s) from mt2 NOT present in mt1.
2. Append the variants NOT present in mt2 into mt1

For 1., I am using this:

sample_names = mt_imp.s.collect()
mt2_filt = mt2.filter_cols(hl.literal(sample_names).contains(mt2.s))

I am not sure how to achieve the 2. Maybe some variation of full_outer_join_mt function but for the rows?

Thank you in advance,

Your #2 is not clear. How are you determining which variants are not present in mt2? If it’s based on mt1, then they already exist.

To select/exclude variants present in one matrix table you can either use semi_join_rows or anti_join_rows depending on your desired results.

Example to select rows from mt1 that are present in mt2.
results = mt1.semi_join_rows(mt2.rows())

Hi @anh151 !

Thank you for your answer and sorry if my #2 question was not clear enough.

Basically what I am trying to do is to join both mt1 and mt2 matrix tables into a new matrix (lets call it mt3). mt2 has some samples in common with mt1 and some variants in common with mt1. What I want to do is to append the content of mt2 into mt1. Since the mt1 and mt2 have different samples, first (#1) I want to keep those samples of mt2 that are present in mt1 (so that mt1 and mt2 have the same samples). And #2, since some variants in mt2 are already present in mt1, in order to avoid having duplicates in the merged final mt3, I want to only append to mt3 those mt2 variants unique to mt2 (not present in mt1).

One challenge that I see with your data is that the column and row fields aren’t exactly the same between the two mt and I’m unsure if hail will be able to handle that. However, below are some options for some of the steps you’re considering. You might have to do some work to get the schemas the same before combining.

If you would like to combine the variants of the two matrix tables you could do the following:

mt3 = mt1.union_rows(mt2)
mt3 = mt3.distinct_by_row()

If you would like to combine the columns of the two matrix tables you could do something similar:

mt3 = mt1.union_cols(mt2)
mt3 = mt3.distinct_by_col()

If you want to only select the cols from mt1 that are in mt2
mt3 = mt1.semi_join_cols(mt2.cols())

Hello again @anh151,
I followed your instructions and now I have some code that does what I want. I am sharing it here:

    # Uniform the schemas
    mt2 = mt2.drop('fam_id', 'pat_id', 'mat_id', 'is_female', 'is_case', 'cm_position')
    # Add varid row field
    mt2 = mt2.annotate_rows(varid = mt2.locus.contig + ':' + hl.str(mt2.locus.position) + '_' + mt2.alleles[0] + '_' + mt2.alleles[1])
    # Perform an anti-join to get rows in mt2 that are not in mt1
    rows_unique_to_mt2 = mt2.anti_join_rows(mt1.rows())
    # Keep only common samples in imputation and called datasets
    # Removes any columns in mt2 that are NOT in mt1.
    rows_unique_to_mt2 = rows_unique_to_mt2.semi_join_cols(mt1.cols())
    # Join mt1 and rows_unique_to_mt2
    # Only append into mt1 those rows_unique_to_mt2 variants not present already in mt1
    mt3 = mt1.union_rows(rows_unique_to_mt2)

All the steps provided above run very fast, except the last one, union_rows. This is extremely slow, probably due to the size of the mts (1M rows and 500K cols).
Is there anyway to speed it up? Also, before executing union_rows command, is there a way to check if rows_unique_to_mt2 has actually any content? And if so, run union_rows, but if not, skip it.

Thank you again,

Hi @irun
It may seem like all the lines except the last one run fast but they are not actually running. Hail uses lazy evaluation and all those tasks/jobs are being queued up and are only executed once the data is needed. You can read more about this in the expressions overview:
https://hail.is/docs/0.2/overview/expressions.html

I don’t think it will be beneficial to check if rows_unique_to_mt2 has any data. Hail would first have to excute all the previous code to check if the condition is met and then likely have to re-execute the code once you get to the final step of your analysis. So it likely ends up in doing double the work.

I’ll also add that hail is designed well in that the analysis steps are combined in order to improve efficiency and to reduce the amount of data that you are working with. In most cases it’s probably best to let hails backend framework handle optimizing steps.

Hope this helps.