UK Biobank RAP - Generating Summary File from Variant QC


I am reaching out regarding an error saving a summary table while using HAIL on the UK Biobank Research Analysis Platform. I have contacted the UKB RAP team regarding this issue, but have not received a conclusive solution with this error.

The summary table is being generated following a set of Variant QC tests on WES data. This error has been generated when working with a subset of only 80 participants from the total 470k dataset.

In most of the errors, the JupyterLab environment automatically closes, and relaunches a new environment. One common error is described as “The machine running the job became unresponsive ”. Another common error is described as “The machine running the job was terminated by the cloud provider ”. When the environment does not close, I receive statements such as:

Error summary: SparkException: Job aborted due to stage failure: Task 1522 in stage 39.0 failed 4 times, most recent failure: Lost task 1522.3 in stage 39.0 (TID 131756) (ip-10-60-33-203.e-west-2.compute.internal executor 11): ExecutorLostFailure (executor 11 exited caused hv one of the running tasks) Reason: Executor heartheat timed out after 132934 m
Error summary: SparkException: Job aborted due to stage failure: Task 0 in stage 40.0 failed 4 times, most recent failure: Lost task 0.3 in stage 40.0 (TID 141408) (ip-10-60-13-15. eu-west-2.compute.internal executor 40): ExecutorLostFailure executor 40 exited caused by one of the running tasks) Reason: worker lost

When I remove this summary component of the script, the prior portion of the script completes successfully without any noted errors.

We would greatly appreciate any insight you can offer regarding this issue. I have included our summary code below:

                'LCR': hl.agg.filter(vcf.filters.contains("LCR"), vcf_aggregation),
                #'VQSR': hl.agg.filter((vcf.filters.intersection(hl.literal({f for f in vcf_header if f.startswith('VQSRTranche')})).length() > 0), vcf_aggregation),
                'LowQual': hl.agg.filter(vcf.filters.contains("LowQual"), vcf_aggregation),
                #'InbreedingCoeff': hl.agg.filter(vcf.filters.contains("InbreedingCoeff"), vcf_aggregation),
                'Mono': hl.agg.filter(vcf.filters.contains("Mono"), vcf_aggregation),
                'Missingness': hl.agg.filter(vcf.filters.contains("Missingness"), vcf_aggregation),
                'PCR': hl.agg.filter(vcf.filters.contains("PCR"), vcf_aggregation),
                'SEQ': hl.agg.filter(vcf.filters.contains("SEQ"), vcf_aggregation),
                'LCSET': hl.agg.filter(vcf.filters.contains("LCSET"), vcf_aggregation),
                #'HMISS': hl.agg.filter(vcf.filters.contains("HMISS"), vcf_aggregation),
                'PASS': hl.agg.filter(((hl.len(vcf.filters) == 0)), vcf_aggregation)
                'HWE{}'.format(pop): hl.agg.filter(vcf.filters.contains("HWE{}".format(pop)), vcf_aggregation)
                for pop in populations
                'HWE{}Female'.format(pop): hl.agg.filter(vcf.filters.contains("HWE{}Female".format(pop)), vcf_aggregation)
                for pop in populations
                'HWE{}Male'.format(pop): hl.agg.filter(vcf.filters.contains("HWE{}Male".format(pop)), vcf_aggregation)
                for pop in populations

print("Checkpoint 10")
output_summary(summary, args.prefix+"_var_qc.tsv.bgz")

Usually, I hit this error when a worker runs out of memory. I cannot comment on which part of the code causes it but if you just want to get on with your life, you can try opting for a beefier instance.

Hello @jsmadsen - thank you for your response. In prior attempts, I have tried increasing my worker memory to up to 128GB per worker recruited. From your experience, might this not be enough even for a test sample set of 80 WES samples? Is it not necessarily true to expect that the memory demand will be linear in scaling to all 470,000 WES samples? Thanks for the help!

Apologies for the late reply, I hope you figured it out. No, that should be more than plenty memory for so few samples. I suspect it is still a memory thing, so somewhere that code might blow up.

In general, Hail like checkpointing after you completed your major filtering script, so if you are not doing that already, it might at least speed troubleshooting up.

You can also replace /lab... in the URL with 8081 to bring up the Spark monitoring in case that might tell you something.

My naive approach would be to comment out lines and see if a single of them causes the issue.

For actual Hail-wizards to help, they generally need the stack trace (not just error message) to figure out what is going on.