Hello,
I am reaching out regarding an error saving a summary table while using HAIL on the UK Biobank Research Analysis Platform. I have contacted the UKB RAP team regarding this issue, but have not received a conclusive solution with this error.
The summary table is being generated following a set of Variant QC tests on WES data. This error has been generated when working with a subset of only 80 participants from the total 470k dataset.
In most of the errors, the JupyterLab environment automatically closes, and relaunches a new environment. One common error is described as “The machine running the job became unresponsive ”. Another common error is described as “The machine running the job was terminated by the cloud provider ”. When the environment does not close, I receive statements such as:
Error summary: SparkException: Job aborted due to stage failure: Task 1522 in stage 39.0 failed 4 times, most recent failure: Lost task 1522.3 in stage 39.0 (TID 131756) (ip-10-60-33-203.e-west-2.compute.internal executor 11): ExecutorLostFailure (executor 11 exited caused hv one of the running tasks) Reason: Executor heartheat timed out after 132934 m
or
Error summary: SparkException: Job aborted due to stage failure: Task 0 in stage 40.0 failed 4 times, most recent failure: Lost task 0.3 in stage 40.0 (TID 141408) (ip-10-60-13-15. eu-west-2.compute.internal executor 40): ExecutorLostFailure executor 40 exited caused by one of the running tasks) Reason: worker lost
When I remove this summary component of the script, the prior portion of the script completes successfully without any noted errors.
We would greatly appreciate any insight you can offer regarding this issue. I have included our summary code below:
summary.append(
vcf.aggregate_rows(
hl.struct(
**{
'LCR': hl.agg.filter(vcf.filters.contains("LCR"), vcf_aggregation),
#'VQSR': hl.agg.filter((vcf.filters.intersection(hl.literal({f for f in vcf_header if f.startswith('VQSRTranche')})).length() > 0), vcf_aggregation),
'LowQual': hl.agg.filter(vcf.filters.contains("LowQual"), vcf_aggregation),
#'InbreedingCoeff': hl.agg.filter(vcf.filters.contains("InbreedingCoeff"), vcf_aggregation),
'Mono': hl.agg.filter(vcf.filters.contains("Mono"), vcf_aggregation),
'Missingness': hl.agg.filter(vcf.filters.contains("Missingness"), vcf_aggregation),
'PCR': hl.agg.filter(vcf.filters.contains("PCR"), vcf_aggregation),
'SEQ': hl.agg.filter(vcf.filters.contains("SEQ"), vcf_aggregation),
'LCSET': hl.agg.filter(vcf.filters.contains("LCSET"), vcf_aggregation),
#'HMISS': hl.agg.filter(vcf.filters.contains("HMISS"), vcf_aggregation),
'PASS': hl.agg.filter(((hl.len(vcf.filters) == 0)), vcf_aggregation)
},
**{
'HWE{}'.format(pop): hl.agg.filter(vcf.filters.contains("HWE{}".format(pop)), vcf_aggregation)
for pop in populations
},
**{
'HWE{}Female'.format(pop): hl.agg.filter(vcf.filters.contains("HWE{}Female".format(pop)), vcf_aggregation)
for pop in populations
},
**{
'HWE{}Male'.format(pop): hl.agg.filter(vcf.filters.contains("HWE{}Male".format(pop)), vcf_aggregation)
for pop in populations
},
)
)
)
print("Checkpoint 10")
output_summary(summary, args.prefix+"_var_qc.tsv.bgz")