Best way to check relatedness in large sample sets

Liz_Cirulli · September 20, 2018, 11:24pm

what hail method would you recommend for identifying related samples to prune out? I am using UKBB samples and have already restricted to unrelated individuals using the publicly available kinship information, but I want to do a second check. I restricted to 15k high frequency LD-pruned variants, but I still cannot find parameters that allow the hl.identity_by_descent (the plink method) command to complete its analysis given the large sample size. Is there a method more like the one used in king that can be used in hail, something that allows much quicker relationship screening?

danking · September 24, 2018, 6:12pm

Hi @Liz_Cirulli!

Can you describe more about your experiences with hl.identity_by_descent? Could you, perhaps, share your pipeline? I would expect IBD to be quite fast. Do you know how many partitions your matrix table has before you call hl.identity_by_descent? The IBD calculation scales like n^2 * m so for larger and larger datasets, it will take some time. It should parallelize somewhat well.

With respect to KING, are you looking for an estimator of relatedness that is robust in the presence of population structure? Hail implements the PC Relate method for relatedness inference as hl.methods.pc_relate. This method works well as long as your input dataset doesn’t have several duplicated samples or single families that dominate the dataset. The Hail docs on pc_relate give some details on the assumptions of the model. The PC Relate paper itself recommends starting from a set of unrelated individuals (a bit of a chicken and egg problem). I’m happy to chat a bit more about how to use PC Relate effectively on the UKBB dataset! It’s also possible that hl.methods.pc_relate's performance is better than hl.identity_by_descent at this point due to some underlying infrastructure work that PC Relate disproportionately benefits from.

tpoterba · September 24, 2018, 6:25pm

this is about 250 billion comparisons per variant for UKBB – that’s immense.

Quadratic scaling is terrifying.

Liz_Cirulli · September 26, 2018, 6:01pm

Thanks so much for your feedback. The matrix table has 2000 partitions. I’m trying to run 13k variants x ~400k samples. I assume the duplicate samples and families have been removed since I used the UKBB’s provided relatedness info for pre-screening, so it may be that pc_relate will work fine for this dataset, and I’ll give it a try.

The code I’ve been trying to use is below. I guess one caveat is that I would rather do the pi_hat screening in the first step as opposed to afterward, but I couldn’t get those options to work (as shown below).

import hail as hl
conf = SparkConf().setAppName(“relatedness check”)
.set(‘spark.driver.memory’, ‘30g’)
.set(‘spark.dynamicAllocation.maxExecutors’, ‘100’)
.set(‘spark.executor.memory’,‘12g’)
.set(‘spark.executor.cores’,‘2’)
.set(‘spark.kryoserializer.buffer.mb’,‘128’)
.set(‘spark.local.dir’, ‘hadoop_temp’)
.set(‘spark.executor.extraJavaOptions=-Djava.io.tmpdir’, ‘hadoop_temp’)
.set(‘spark.driver.extraJavaOptions=-Djava.io.tmpdir’, ‘hadoop_temp’)
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
hl.init(sc, default_reference=‘GRCh38’,branching_factor=10)

mt=hl.read_matrix_table(‘filename.mt’)
relatedness=hl.identity_by_descent(mt)
relatives=relatedness.filter(relatedness.ibd.PI_HAT>0.2)
relatives.write(‘filename.ht’,overwrite=True)

this was the command I initially wanted to run but that didn’t work:
relatedness=hl.identity_by_descent(mt,maf=mt.variant_qc.AF[1], min=0.2, max=1)
or just plain:
relatedness=hl.identity_by_descent(mt,min=0.2, max=1)
which also didn’t work.

Thanks again.

danking · October 9, 2018, 1:59pm

Hi @Liz_Cirulli, thanks for the additional information! Can you share the hail.log file associated with the failing runs? This is usually found in the working directory of the executor process. Could you describe more about the behavior you encountered and also any error messages you encountered? How many executors did you observe working (this is the number on the right hand side of the progress bar: [Stage 0:> (N_PARTITIONS_FINISHED + N_EXECUTORS_WORKING) / N_PARTITIONS_TOTAL])? In which stage did the bad behavior begin?

Can you also describe more about what did not work with the latter two commands? They both look syntactically correct. Did you encounter an error or were they also just too slow?

Topic		Replies	Views
[Experimental] Population Aware Relatedness Estimation Updates	0	1521	August 12, 2017
LD pruning and IBD Hail Query & hailctl	9	1235	November 10, 2023
Calculating relatedness in 40.000 samples Hail Query & hailctl	0	44	November 28, 2024
Maximal_independent_set Hail Query & hailctl	4	508	January 21, 2021
Realized_relationship_matrix performance Hail Query & hailctl	5	524	July 31, 2020

Best way to check relatedness in large sample sets

Related topics