After the union_cols() number of rows decreases

Good day!

I am experimenting with hail, so I created a couple of files and try to merge them, via columns. Though I think I see some strange behaviour:

Number or rows and cols in the original, i.e. before merge, mtx 76023 3
Number of rows and cole in the appanded mtx: 76023 3
Union's number or rows and cols:  75997 6

I wonder why might that happen and how can I check what’s going on? What’s happened to the matrices, so the united one has less rows, compared with the two from which it originates. Thank you

From the union_cols docs:

    This method creates a :class:`.MatrixTable` which contains all columns
    from both input datasets. The set of rows included in the result is
    determined by the `row_join_type` parameter.

    - With the default value of ``'inner'``, an inner join is performed
      on rows, so that only rows whose row key exists in both input datasets
      are included. In this case, the entries for each row are the
      concatenation of all entries of the corresponding rows in the input
    - With `row_join_type` set to  ``'outer'``, an outer join is perfomed on
      rows, so that row keys which exist in only one input dataset are also
      included. For those rows, the entry fields for the columns coming
      from the other dataset will be missing.

    Only distinct row keys from each dataset are included (equivalent to
    calling :meth:`.distinct_by_row` on each dataset first).

    This method does not deduplicate; if a column key exists identically in
    two datasets, then it will be duplicated in the result.

My guess is that either (a) there are duplicate variants in the two datasets that are being removed to make the keys distinct, or (b) there are mismatched row keys (locus/alleles likely) between the two datasets that are being removed.

You can test hypothesis (a) by running distinct_by_row().count() on each component MT and seeing if the count decreases, and you can test hypothesis (b) by using union_cols(mt1, mt2, row_join_type='outer').