Hello,
I’m hoping to get some help with debugging a fatal error I got while running Query-on-Batch in a jupyter notebook on my laptop. The full traceback + error dump is in the attached txt file. Briefly, my workflow consists of:
- Read a matrix table on GCP
- Read a local tsv into pandas and make hail dict
- Aggressively downsample rows and columns in matrix table
- Annotate columns of matrix table from hail dict
- Annotate rows with an aggregate grouped by the new column annotation
- Get row table
- Format and export row table
The error traceback points to the final export, but I’m guessing from run times etc. that everything here is waiting on lazy evals until hitting the export. Previous testing with this workflow had been working fine with (a) using head instead of sample_rows/cols at step 3, and (b) sending the row table to_pandas instead of export at step 7. Trying to run with sample_rows/cols and to_pandas also hit a similar looking fatal error, but I didn’t save that error message.
On the batch side, job 2077741 is the culprit with 2 failing jobs (of 22925), but I don’t know enough to parse those java errors. From submission behavior it looked like job 2077740 is also related as the parent/control job. (I’m assuming you have already have access to the batch logs from your end, but happy to grab the those outputs if needed.) If you want to look at the version that failed with to_pandas instead of export, I think that was jobs 2077734 (parent) / 2077736 (3 failed of 22k).
Let me know what additional info would be useful, if any. Thanks in advance for the help!
query_batch_dump.txt (29.9 KB)