Identify duplicated sample groups

DBScan · October 9, 2024, 12:33pm

How can I group duplicated samples after running pc_relate?

I am running pc_relate with the following command:

rel = hl.pc_relate(mt.GT,
        min_individual_maf = 0.01,
        k = 20,
        statistics = "kin",
        min_kinship = (1/(2**1.5)))

The table after running pc_relate looks like this:

i.s	j.s	Column 3	Column 4
Sample1_Rep1	Sample1_Rep2
Sample1_Rep2	Sample1_Rep3
Sample2_Rep1	Sample2_Rep2
Sample3_Rep1	Sample3_Rep2
Sample3_Rep2	Sample3_Rep3
Sample3_Rep3	Sample3_Rep4
Sample3_Rep2	Sample3_Rep3
Sample3_Rep2	Sample3_Rep4
Sample3_Rep3	Sample3_Rep4

Sample1 would have been sequenced 3 times, Sample2 2 times, and Sample3 4 times.
I would like to have a table in the following format:

Sample	Group	Column 3	Column 4
Sample1_Rep1	1
Sample1_Rep2	1
Sample1_Rep3	1
Sample2_Rep1	2
Sample2_Rep2	2
Sample3_Rep1	3
Sample3_Rep2	3
Sample3_Rep3	3
Sample3_Rep4	3

I have tried it with pandas, but so far I was not successful; I always end up with too many groups. For instance, Sample3_Rep1 and Sample3_Rep2 is a single group, Sample3_Rep1 and Sample3_Rep3 is another group.

kasittig · October 9, 2024, 8:38pm

Would you be able to post the code snippet where you’re doing the grouping? My guess is that something wonky is going on with your key selection. I’m remembering that the Python / pandas syntax here can be a little tricky so hopefully it’s an easy fix!

DBScan · October 11, 2024, 11:49am

I gave up with pandas, and instead used igraph for this task.

# Identify duplicates samples
rel = hl.pc_relate(mt.GT, 
        min_individual_maf = 0.01, 
        k = 20, 
        statistics = "kin", 
        min_kinship = (1/(2**1.5)))

# Convert to pandas df with three 3 columns:
# i.s, j.s and kinship
rel_df = rel.to_pandas()

# Create a graph
g = ig.Graph.DataFrame(rel_df, use_vids = False)

# Create network, because we only know A->B and B->C, but not A->C
components = g.connected_components(mode = "weak")

# Sample names
sample_names = g.vs["name"]

# Group membership
group = components.membership

# Create pandas dataframe
rel_df = pd.DataFrame(data = {"Sample": sample_names, "group": [x + 1 for x in group]})

Topic		Replies	Views
Best way to check relatedness in large sample sets Help [0.1]	4	1387	October 9, 2018
[Experimental] Population Aware Relatedness Estimation Updates	0	1521	August 12, 2017
Calculating relatedness in 40.000 samples Hail Query & hailctl	0	45	November 28, 2024
Find duplicates in Matrixtable Hail Query & hailctl	1	662	March 15, 2021
Ancestry inference in Hail Help [0.1]	7	1855	February 28, 2018

Identify duplicated sample groups

Related topics