Maximal independent set behaving inconsistently

ch-kr · December 12, 2019, 6:34pm

I’m running maximal_independent_set on a ht created using pc_relate but am seeing inconsistent filtering results.

Here is the code:

def rank_related_samples(
        relatedness_ht: hl.Table,
        qc_ht: hl.Table
) -> Tuple[hl.Table, Callable[[hl.expr.Expression, hl.expr.Expression], hl.expr.NumericExpression]]:

    def annotate_related_pairs(related_pairs: hl.Table, index_col: str) -> hl.Table:
        related_pairs = related_pairs.key_by(**related_pairs[index_col])
        return related_pairs.annotate(
            **{
                index_col: related_pairs[index_col].annotate(
                    dp_mean=hl.or_else(qc_ht[related_pairs.key].sample_qc.dp_stats.mean, -1.0)
                )
            }
        ).key_by()

    relatedness_ht = annotate_related_pairs(relatedness_ht, "i")
    relatedness_ht = annotate_related_pairs(relatedness_ht, "j")

    def tie_breaker(l, r):
        return (l.dp_mean - r.dp_mean)

    return relatedness_ht, tie_breaker

and

data_source = 'broad'
freeze = 4
relatedness_ht = hl.read_table(relatedness_ht_path(data_source, freeze))
relatedness_ht = relatedness_ht.filter(relatedness_ht.kin > 0.08838835)

related_pairs_ht, related_pairs_tie_breaker = rank_related_samples(
            relatedness_ht,
            hl.read_table(qc_ht_path(data_source, freeze))
)

related_samples_to_drop_ht = hl.maximal_independent_set(related_pairs_ht.i, related_pairs_ht.j,
                                                        keep=False, tie_breaker=related_pairs_tie_breaker)
related_samples_to_drop_ht = related_samples_to_drop_ht.key_by()
related_samples_to_drop_ht = related_samples_to_drop_ht.select(**related_samples_to_drop_ht.node)
related_samples_to_drop_ht = related_samples_to_drop_ht.key_by('s')

I noticed that running this code sometimes selects a sample from a pair that has the higher mean depth but sometimes selects a sample from a pair with the lower mean depth. I checked using two separate pairs, and the samples in those lines are unique to those pairs. I also get different filtering results when I run the code on the full relatedness table, when I run the code on the table filtered to each unique pair, and when I run the code filtered to the two example pairs. Happy to share an HTML of my notebook that is clearer than my text description.

How should the tiebreaker be behaving, and why does it return different results?

danking · December 12, 2019, 8:48pm

It’s possible there is a bug here. Is there a subset of related_pairs_ht that exhibits the behavior? If you could send us a failing example table, I’m sure we can figure out what the issue is.

ch-kr · December 12, 2019, 8:54pm

Yes! I will slack it to you. Thank you!

ch-kr · December 13, 2019, 8:42pm

Sorry for posting again, but this issue is currently a blocker

danking · December 13, 2019, 10:47pm

Thanks for the bug report @ch-kr the fix is PR’ed https://github.com/hail-is/hail/pull/7729

Topic		Replies	Views
Maximal_independent_set() guide correction Hail Query & hailctl	1	521	December 11, 2018
Maximal_independent_set Hail Query & hailctl	4	508	January 21, 2021
Inconsistent output from same code Hail Query & hailctl	2	404	November 4, 2020
Inconsistent per sample QC result Hail Query & hailctl	3	397	March 15, 2022
Filter mt rows on ht error Hail Query & hailctl	10	660	April 7, 2020

Maximal independent set behaving inconsistently

Related topics