I ran ‘hl.experimental.run_combiner()’ with ‘branch factor value=100’ and ‘batch factor = 100’ for 1000 WGS gvcfs in GCP. In my successful run, there are some failed tasks causing multiple attempts and long runtime. I used the n1-highmem-32 machine type, spark.executor.mem = 72g and spark.executor.core = 8. Based on these spark conf parameters, each node launched up to 2 executors. Please give me some advice on how to remove these failure tasks. Thank you.
Round 1 (1000 gvcfs --> 10 MT) : 234 failed / 250 tasks
Error 1 : ExecutorLostFailure (executor 30 exited unrelated to the running tasks) Reason: Container marked as failed: container_1594240183345_0002_01_000045 on host: ****. Exit status: -100. Diagnostics: Container released on a lost node.
Round 2 (10 MT --> 1 MT) : 4567 failed / 95623 tasks
Error 2 : ExecutorLostFailure (executor 78 exited caused by one of the running tasks) Reason: Container from a bad node: container_1594240183345_0002_01_000124 on host: ****. Exit status: 134. Diagnostics: [2020-07-11 08:44:42.110]Exception from container-launch.
Error 3 : java.io.IOException: Error accessing gs://*****/metadata.json.gz