Good day!
I am experimenting with hail, so I created a couple of files and try to merge them, via columns. Though I think I see some strange behaviour:
Number or rows and cols in the original, i.e. before merge, mtx 76023 3
Number of rows and cole in the appanded mtx: 76023 3
Union's number or rows and cols: 75997 6
I wonder why might that happen and how can I check what’s going on? What’s happened to the matrices, so the united one has less rows, compared with the two from which it originates. Thank you
From the union_cols
docs:
This method creates a :class:`.MatrixTable` which contains all columns
from both input datasets. The set of rows included in the result is
determined by the `row_join_type` parameter.
- With the default value of ``'inner'``, an inner join is performed
on rows, so that only rows whose row key exists in both input datasets
are included. In this case, the entries for each row are the
concatenation of all entries of the corresponding rows in the input
datasets.
- With `row_join_type` set to ``'outer'``, an outer join is perfomed on
rows, so that row keys which exist in only one input dataset are also
included. For those rows, the entry fields for the columns coming
from the other dataset will be missing.
Only distinct row keys from each dataset are included (equivalent to
calling :meth:`.distinct_by_row` on each dataset first).
This method does not deduplicate; if a column key exists identically in
two datasets, then it will be duplicated in the result.
My guess is that either (a) there are duplicate variants in the two datasets that are being removed to make the keys distinct, or (b) there are mismatched row keys (locus/alleles likely) between the two datasets that are being removed.
You can test hypothesis (a) by running distinct_by_row().count()
on each component MT and seeing if the count decreases, and you can test hypothesis (b) by using union_cols(mt1, mt2, row_join_type='outer')
.