Stuck at the last steps of a stage


We upgraded to the latest (0.2.26) because the gnomad v3 table is only compatible with 0.2.24 and above. We’re running into issues where it gets stuck at one of the stages Stage 3: ....(9373 + 5) / 9378]. This has been reproduced, and on 0.2.24 it stops at Stage 2 but right at the end again.

I looked at the logs and couldn’t find any Error or anything except lost workers, but I might not know what to look for.

Any help would be very appreciated.

if you look at the Spark UI, does it show 5 tasks remaining in progress?

Teh thing to do here is probably turn on Spark speculation.

I looked through the spark ui a bit but couldn’t really decipher it. Is turning on spark speculation just setting spark.speculation to true? I’m not sure if the task is just running slowly, we leave it on for 30 minutes even and invariably they just get stuck.

The stage / task view in the UI might help indicate which tasks are the problem, or if this is a UI bug and something else is to blame.

And yep, spark.speculation=true is the right thing

I’ll check that out soon, but here’s something interesting. I’m running the same thing but on 0.2.22 on gnomad v2 table and it seems to be progressing to stage 7 (not sure how the stages differ across versions). So it doesn’t look to be the VCF, is there anything else we can look into?

hmm, weird. What’s the pipeline?

It’s our seqr pipeline that joins the vcf variants with a bunch of annotations from a reference data set.

It looks like setting speculation to true allows it to make progress.

If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.

Does the slow task reach a bug or something and needs to be restarted? Why does that fix it?

This is something we’ve never been able to characterize. Performance should be reproducible, but clearly there are cases where it’s not.

Maybe a Spark executor gets in bad state thrashing memory, maybe something else…

Should that setting be put in the general hailctl or does it affect performance?

I think we’re planning on putting that as a hailctl default, yeah.

1 Like