How can I group duplicated samples after running pc_relate
?
I am running pc_relate
with the following command:
rel = hl.pc_relate(mt.GT,
min_individual_maf = 0.01,
k = 20,
statistics = "kin",
min_kinship = (1/(2**1.5)))
The table after running pc_relate
looks like this:
i.s | j.s | Column 3 | Column 4 |
---|---|---|---|
Sample1_Rep1 | Sample1_Rep2 | ||
Sample1_Rep2 | Sample1_Rep3 | ||
Sample2_Rep1 | Sample2_Rep2 | ||
Sample3_Rep1 | Sample3_Rep2 | ||
Sample3_Rep2 | Sample3_Rep3 | ||
Sample3_Rep3 | Sample3_Rep4 | ||
Sample3_Rep2 | Sample3_Rep3 | ||
Sample3_Rep2 | Sample3_Rep4 | ||
Sample3_Rep3 | Sample3_Rep4 |
Sample1 would have been sequenced 3 times, Sample2 2 times, and Sample3 4 times.
I would like to have a table in the following format:
Sample | Group | Column 3 | Column 4 |
---|---|---|---|
Sample1_Rep1 | 1 | ||
Sample1_Rep2 | 1 | ||
Sample1_Rep3 | 1 | ||
Sample2_Rep1 | 2 | ||
Sample2_Rep2 | 2 | ||
Sample3_Rep1 | 3 | ||
Sample3_Rep2 | 3 | ||
Sample3_Rep3 | 3 | ||
Sample3_Rep4 | 3 |
I have tried it with pandas, but so far I was not successful; I always end up with too many groups. For instance, Sample3_Rep1 and Sample3_Rep2 is a single group, Sample3_Rep1 and Sample3_Rep3 is another group.