I am trying to estimate identity by descent on google cloud dataproc cluster using post-QC matrix table(multi allelic variants were split, variants with call rate > 0.95 and samples with call rate > 0.95) with:
ht = hl.identity_by_descent(mt)
and constantly getting an error:
Hail version: 0.2.11-adfb5ad12c3c
Error summary: SparkException: Job aborted due to stage failure: ShuffleMapStage 146 (map at IBD.scala:266) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Failure while fetching StreamChunkId{streamId=1663420509000, chunkIndex=0}: java.lang.RuntimeException: Failed to open file: /mnt/sdb/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1553470753653_0002/blockmgr-93e41ab8-8a87-4e02-8bd4-c5ba7e30e62f/32/shuffle_23_214_0.index
Thanks for your response. Tried with no preemptible nodes in the dataproc cluster. Still got an error:
Error summary: SparkException: Job aborted due to stage failure: Task 1 in stage 6.0 failed 20 times, most recent failure: Lost task 1.19 in stage 6.0 (TID 2523, art-cluster-w-0.c.daly-lab.internal, executor 39): ExecutorLostFailure (executor 39 exited caused by one of the running tasks) Reason: Container marked as failed: container_1553622353535_0001_01_000041 on host: art-cluster-w-0.c.daly-lab.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal
Attempted to run several times, always getting this error.