LD pruning and IBD

hailstorm · March 28, 2021, 10:04am

Hi - I have not been successful in using LD prune. Below is the code I am using.

b_mt = mt.filter_rows(hl.len(mt.alleles) == 2)
t = b_mt.repartition(100)
b_mt.write("repartitioned.mt", overwrite=True)

r_mt = hl.read_matrix_table('repartitioned.mt')
pruned_tbl = hl.ld_prune(r_mt.GT, r2 = 0.2, bp_window_size = 500000, block_size=75)

It takes a long time (hours) for a mt of ~1000 individuals having ~500,000 genotypes each.

I have tried various server configs but does not seem to help. Is there an example on LD pruning. My end goal is to find relatedness (king) within samples.

Any help is appreciated.

Thanks!

danking · March 29, 2021, 3:41pm

Hi @hailstorm ,

I’m sorry you’re having trouble! I think we can fix this quickly.

Are you running this on a laptop, on a private cluster, on an Amazon EMR cluster, or on a Google Dataproc cluster? How many cores are you using?

Do you have a reason to set the block_size to 75? That severely negatively impacts performance. I suggest leaving it as the default, 4096.

Are you sure you need to run ld prune on all 500,000 variants? I suspect most relatedness methods do not need that many variants to provide good estimates. I believe many folks use only several thousand variants for relatedness.

How much memory does the the computer or cluster have? If a cluster, how much memory per executor or VM do you have?

How do you start your job? Do you use ipython, python3, pyspark, spark-submit, or one of the cloud provider commands?

hailstorm · March 29, 2021, 4:00pm

Thanks @danking! Good to hear this can be fixed

This is being run on a 8 vCPU / 32GB RAM GC instance. But I have tried on a 3 node cluster and the results are the same.
I noticed block_size 75 on one of the posts and thought this enhances performance. Will reset it default as you had suggested.
This was a default setting in one of your posts. I will try a lower number and see if that helps.
Tried both ipython and python3.

I will run with the above settings and see if it improves the execution time.

Is using one over the other enhance performance?- ipython , python3 , pyspark , spark-submit

danking · March 29, 2021, 4:21pm

How many total cores did the 3 node cluster have? Hail unfortunately is only moderately fast on a single computer. Hail is valuable because it can scale up to hundreds or thousands of cores. Most of our users use Google Dataproc or Amazon EMR to briefly access very large clusters. In the cloud, we pay per core-hour, so 1000 cores for one hour costs the same as 10 cores for 100 hours. If you’d like to try Hail on the cloud, we have introductory material in the docs.

Can you link to the post that sets block_size to 75? I’d like to fix that post.

The different executables do not affect performance but the affect how you set parameters. What is the output of:

echo $PYSPARK_SUBMIT_ARGS

This variable needs to specify how much memory is available to Hail. See this post for information on how to set that variable. Try setting both the executor memory and driver memory to the total amount of RAM on your computer.

hailstorm · April 1, 2021, 9:07am

Thanks again Dan. Will try the options you have listed in the above thread.

This Github post thread has reference to block size and and window size. Maybe a followup thread to experiment with lower figures will help others in the future?

danking · April 1, 2021, 12:46pm

Thanks! I’ve edited that post with a note.

Please do share the results of the new options!

hailstorm · April 3, 2021, 7:21am

Thanks Dan! I tried all three methods IBD, King and PC-relate with the above sample set (1000 samples x 500,000 SNPs)

On a 10 node (each with 4 CPUs/16G RAM) cluster:

IBD is the fastest (few mins)
PC-relate is the slowest (few hours).
King takes ~32 mins but don’t think the results are correct.

The King method estimates kinship coefficient as 0.49 of some samples that are completely unrelated (validated using another program to find shared DNA segments).

danking · April 5, 2021, 2:16pm

That’s quite concerning! I’d really like to get to the bottom of that, if you don’t mind helping!

Can you share any other information about the dataset? What is the ancestral structure of the samples? How many of the samples have recent admixture? Is it possible for me to get access to it (or some appropriately sanitized version) for technical development purposes? What percent of genotypes are missing? What kinships do IBD and pc-relate report for those sample-pairs?

I’m not too surprised about the timings. IBD uses a very naive model (it assumes a single homogenous population). Regarding PC-Relate, to what did you set statistics? By default, Hail computes kinship, ibd-0, ibd-1, and ibd-2. I expect the statistics='kin' to take about 2x the time of King.

hailstorm · April 16, 2021, 9:41am

Thanks Dan! I figured it was easier to export the Mt to VCF and then run the downstream analysis. I am sure this will become a constraint as the data grows in size, but feel this is faster with current dataset. Additionally, have more control over what options are fed into king or pc-relate.

shengwei66 · November 10, 2023, 4:59pm

Hello hail team,

We recently used the identity_by_descent function from hail to analyze the WGS data from UK Biobank for 20 patients of our disease cohort. In the IBD results, the values in the PI_HAT column are nearly close to 0.5 for all paired individuals. This means that these 20 patients are all closely related, which doesn’t sound right.

Before applying the identity_by_descent function, we processed the WGS vcf files by the following steps:
• Run the import_vcf function to load the individual vcf file into the matrixtable format in hail
• Filter the variants to non-reference regions for each patient
• Annotate the variants with Gnomad allele frequencies for each patient
• Filter the common variants with Gnomad AF >= 5% for each patients
• Merge 20 patients data into one matrixtable using outer join
• Prune linkage disequilibrium on the merged matrixtable by using the ld_prune function.
Then we ran the relatedness analysis by applying the identity_by_descent function to the pruned matrixtable.

I am not sure why we are getting the results of PI_HAT = 0.5 for all pairs of 20 patients, and maybe there are something going wrong in the data processing steps above. Your help in resolving this issue will be greatly appreciated.

Topic		Replies	Views
LD pruning not finishing running Hail Query & hailctl	1	381	April 28, 2022
Ld_prune() returns SparkException Hail Query & hailctl	16	744	December 11, 2018
Ld_prune() out of memory Hail Query & hailctl	9	493	March 14, 2022
Ld_prune starts and stops error Hail Query & hailctl	1	666	May 30, 2019
LD pruning repeated errors Hail Query & hailctl	16	546	December 20, 2020

LD pruning and IBD

Related topics