LD pruning and IBD

Hi - I have not been successful in using LD prune. Below is the code I am using.

b_mt = mt.filter_rows(hl.len(mt.alleles) == 2)
t = b_mt.repartition(100)
b_mt.write("repartitioned.mt", overwrite=True)

r_mt = hl.read_matrix_table('repartitioned.mt')
pruned_tbl = hl.ld_prune(r_mt.GT, r2 = 0.2, bp_window_size = 500000, block_size=75)

It takes a long time (hours) for a mt of ~1000 individuals having ~500,000 genotypes each.

I have tried various server configs but does not seem to help. Is there an example on LD pruning. My end goal is to find relatedness (king) within samples.

Any help is appreciated.

Thanks!

Hi @hailstorm ,

I’m sorry you’re having trouble! I think we can fix this quickly.

Are you running this on a laptop, on a private cluster, on an Amazon EMR cluster, or on a Google Dataproc cluster? How many cores are you using?

Do you have a reason to set the block_size to 75? That severely negatively impacts performance. I suggest leaving it as the default, 4096.

Are you sure you need to run ld prune on all 500,000 variants? I suspect most relatedness methods do not need that many variants to provide good estimates. I believe many folks use only several thousand variants for relatedness.

How much memory does the the computer or cluster have? If a cluster, how much memory per executor or VM do you have?

How do you start your job? Do you use ipython, python3, pyspark, spark-submit, or one of the cloud provider commands?

Thanks @danking! Good to hear this can be fixed :slight_smile:

  1. This is being run on a 8 vCPU / 32GB RAM GC instance. But I have tried on a 3 node cluster and the results are the same.
  2. I noticed block_size 75 on one of the posts and thought this enhances performance. Will reset it default as you had suggested.
  3. This was a default setting in one of your posts. I will try a lower number and see if that helps.
  4. Tried both ipython and python3.

I will run with the above settings and see if it improves the execution time.

Is using one over the other enhance performance?- ipython , python3 , pyspark , spark-submit

How many total cores did the 3 node cluster have? Hail unfortunately is only moderately fast on a single computer. Hail is valuable because it can scale up to hundreds or thousands of cores. Most of our users use Google Dataproc or Amazon EMR to briefly access very large clusters. In the cloud, we pay per core-hour, so 1000 cores for one hour costs the same as 10 cores for 100 hours. If you’d like to try Hail on the cloud, we have introductory material in the docs.

Can you link to the post that sets block_size to 75? I’d like to fix that post.

The different executables do not affect performance but the affect how you set parameters. What is the output of:

echo $PYSPARK_SUBMIT_ARGS

This variable needs to specify how much memory is available to Hail. See this post for information on how to set that variable. Try setting both the executor memory and driver memory to the total amount of RAM on your computer.

Thanks again Dan. Will try the options you have listed in the above thread.

This Github post thread has reference to block size and and window size. Maybe a followup thread to experiment with lower figures will help others in the future?

Thanks! I’ve edited that post with a note.

Please do share the results of the new options!

Thanks Dan! I tried all three methods IBD, King and PC-relate with the above sample set (1000 samples x 500,000 SNPs)

On a 10 node (each with 4 CPUs/16G RAM) cluster:

  • IBD is the fastest (few mins)
  • PC-relate is the slowest (few hours).
  • King takes ~32 mins but don’t think the results are correct.

The King method estimates kinship coefficient as 0.49 of some samples that are completely unrelated (validated using another program to find shared DNA segments).

That’s quite concerning! I’d really like to get to the bottom of that, if you don’t mind helping!

Can you share any other information about the dataset? What is the ancestral structure of the samples? How many of the samples have recent admixture? Is it possible for me to get access to it (or some appropriately sanitized version) for technical development purposes? What percent of genotypes are missing? What kinships do IBD and pc-relate report for those sample-pairs?

I’m not too surprised about the timings. IBD uses a very naive model (it assumes a single homogenous population). Regarding PC-Relate, to what did you set statistics? By default, Hail computes kinship, ibd-0, ibd-1, and ibd-2. I expect the statistics='kin' to take about 2x the time of King.