After the union_cols() number of rows decreases

annalisasnow · January 19, 2022, 2:33pm

Good day!

I am experimenting with hail, so I created a couple of files and try to merge them, via columns. Though I think I see some strange behaviour:

Number or rows and cols in the original, i.e. before merge, mtx 76023 3
Number of rows and cole in the appanded mtx: 76023 3
Union's number or rows and cols:  75997 6

I wonder why might that happen and how can I check what’s going on? What’s happened to the matrices, so the united one has less rows, compared with the two from which it originates. Thank you

tpoterba · January 19, 2022, 3:06pm

From the union_cols docs:

    This method creates a :class:`.MatrixTable` which contains all columns
    from both input datasets. The set of rows included in the result is
    determined by the `row_join_type` parameter.

    - With the default value of ``'inner'``, an inner join is performed
      on rows, so that only rows whose row key exists in both input datasets
      are included. In this case, the entries for each row are the
      concatenation of all entries of the corresponding rows in the input
      datasets.
    - With `row_join_type` set to  ``'outer'``, an outer join is perfomed on
      rows, so that row keys which exist in only one input dataset are also
      included. For those rows, the entry fields for the columns coming
      from the other dataset will be missing.

    Only distinct row keys from each dataset are included (equivalent to
    calling :meth:`.distinct_by_row` on each dataset first).

    This method does not deduplicate; if a column key exists identically in
    two datasets, then it will be duplicated in the result.

My guess is that either (a) there are duplicate variants in the two datasets that are being removed to make the keys distinct, or (b) there are mismatched row keys (locus/alleles likely) between the two datasets that are being removed.

You can test hypothesis (a) by running distinct_by_row().count() on each component MT and seeing if the count decreases, and you can test hypothesis (b) by using union_cols(mt1, mt2, row_join_type='outer').

Topic		Replies	Views
Union of columns Hail Query & hailctl	7	1661	May 31, 2023
Avoiding duplicate rows after MatrixTable.union_rows Hail Query & hailctl	3	564	March 1, 2019
Mt.count() orders of magnitude slower after union_cols() - why? Hail Query & hailctl	2	342	April 12, 2023
[Hail 0.2] Merge two MatrixTable Help [0.1]	11	2977	November 19, 2019
Merge vcf or matrix tables Hail Query & hailctl	4	569	January 14, 2021

After the union_cols() number of rows decreases

Related topics