I’m sorry you’re having trouble! I think we can fix this quickly.
Are you running this on a laptop, on a private cluster, on an Amazon EMR cluster, or on a Google Dataproc cluster? How many cores are you using?
Do you have a reason to set the block_size to 75? That severely negatively impacts performance. I suggest leaving it as the default, 4096.
Are you sure you need to run ld prune on all 500,000 variants? I suspect most relatedness methods do not need that many variants to provide good estimates. I believe many folks use only several thousand variants for relatedness.
How much memory does the the computer or cluster have? If a cluster, how much memory per executor or VM do you have?
How do you start your job? Do you use ipython, python3, pyspark, spark-submit, or one of the cloud provider commands?
How many total cores did the 3 node cluster have? Hail unfortunately is only moderately fast on a single computer. Hail is valuable because it can scale up to hundreds or thousands of cores. Most of our users use Google Dataproc or Amazon EMR to briefly access very large clusters. In the cloud, we pay per core-hour, so 1000 cores for one hour costs the same as 10 cores for 100 hours. If you’d like to try Hail on the cloud, we have introductory material in the docs.
Can you link to the post that sets block_size to 75? I’d like to fix that post.
The different executables do not affect performance but the affect how you set parameters. What is the output of:
This variable needs to specify how much memory is available to Hail. See this post for information on how to set that variable. Try setting both the executor memory and driver memory to the total amount of RAM on your computer.
That’s quite concerning! I’d really like to get to the bottom of that, if you don’t mind helping!
Can you share any other information about the dataset? What is the ancestral structure of the samples? How many of the samples have recent admixture? Is it possible for me to get access to it (or some appropriately sanitized version) for technical development purposes? What percent of genotypes are missing? What kinships do IBD and pc-relate report for those sample-pairs?
I’m not too surprised about the timings. IBD uses a very naive model (it assumes a single homogenous population). Regarding PC-Relate, to what did you set statistics? By default, Hail computes kinship, ibd-0, ibd-1, and ibd-2. I expect the statistics='kin' to take about 2x the time of King.
Thanks Dan! I figured it was easier to export the Mt to VCF and then run the downstream analysis. I am sure this will become a constraint as the data grows in size, but feel this is faster with current dataset. Additionally, have more control over what options are fed into king or pc-relate.