I was running our hail-based Sample QC script (hail_sample_qc_MG_YL_101423_Step1.txt (10.6 KB)) on the ~470,000 exomes on the UKB Biobank Research Analysis Platform, and almost reached the end of the script when the job seemed to stall on Stage 65 on the very last task. I allowed the job to run for roughly 3 hours after the task first seemed to stall, and still it did not finish. Nothing was written in the hail.log file during this time that referenced any sort of error or warning. For context, the job seemed to get stuck on line 285 or 287 of the script.
I have attached some screenshots from the Spark UI that might lend insight into the potential issue. Was the job still running as intended? Should I expect to see this type of stalling when running the hl.variant_qc function on so many exomes?
As you can see, some of the tasks in earlier Stages failed, but they seemed to successfully re-start and reach successful completion: