I am trying to get the maximum independent set of phenotypes based on phenotype correlations I have generated from the UKB. I have already made the pairs file as described in the documentation (Hail | Miscellaneous) and now am wanting to prune them to the max indep sets. I am trying several cor thresholds to see how many phenotypes would remain at different ones, but strangely am getting the same number in the max set at the end despite differing lengths of pairs lists to prune. I am directly following the example in the docs so want to make sure something isn’t going wrong. Advice would be appreciated!
Here’s the relevant code:
ht = hl.read_table(‘Pheno_pairwise_correlations.ht’)
pairs10 = ht.filter(hl.float32(ht.cor) > 0.1) #96264 long for cor at 10% limit
pairs20 = ht.filter(hl.float32(ht.cor) > 0.2) #283680 long for cor at 20% limit
related_to_remove10 = hl.maximal_independent_set(pairs10.i, pairs10.j, False) #14165 long
related_to_remove20 = hl.maximal_independent_set(pairs20.i, pairs20.j, False) #14165 long also
Thanks so much,
Hey @eatkinson !
It looks like you’re asking for the nodes that should be removed, i.e. the complement of the set of nodes to keep. Try changing the third argument of MIS to
True. I like to use the keyword argument syntax so this sticks out a bit:
Thanks for getting back to me. Yes, I am indeed trying to get a list of the ones to be removed that I will then exclude from the full dataset. This is how the process was presented in the docs, which have the following as the example:
Starting from the above pairs, prune individuals from a dataset until no close relationships remain:
related_samples_to_remove = hl.maximal_independent_set(pairs.i, pairs.j, False)
result = dataset.filter_cols(
… hl.is_defined(related_samples_to_remove[dataset.col_key]), keep=False)
As an aside, if I do change the 3rd argument to True, I end up with a list 0 long.
Ah, OK, I understand. The first thing I would check is: are there any self-correlations?
ht.filter(ht.i == ht.j).show()
If a node has a “self-edge” (an edge to itself) then it cannot be in the independent set (it’s always connected to another node, namely itself). You can remove these with:
ht = ht.filter(ht.i != ht.j)
oh makes sense! I just tried removing self-correlations (which indeed were there) and it results in different lengths now. Thanks so much!