Fatal error at row table export after sample_rows/cols with Query-on-Batch in jupyter

rkwalters · April 20, 2022, 8:04pm

Hello,
I’m hoping to get some help with debugging a fatal error I got while running Query-on-Batch in a jupyter notebook on my laptop. The full traceback + error dump is in the attached txt file. Briefly, my workflow consists of:

Read a matrix table on GCP
Read a local tsv into pandas and make hail dict
Aggressively downsample rows and columns in matrix table
Annotate columns of matrix table from hail dict
Annotate rows with an aggregate grouped by the new column annotation
Get row table
Format and export row table

The error traceback points to the final export, but I’m guessing from run times etc. that everything here is waiting on lazy evals until hitting the export. Previous testing with this workflow had been working fine with (a) using head instead of sample_rows/cols at step 3, and (b) sending the row table to_pandas instead of export at step 7. Trying to run with sample_rows/cols and to_pandas also hit a similar looking fatal error, but I didn’t save that error message.

On the batch side, job 2077741 is the culprit with 2 failing jobs (of 22925), but I don’t know enough to parse those java errors. From submission behavior it looked like job 2077740 is also related as the parent/control job. (I’m assuming you have already have access to the batch logs from your end, but happy to grab the those outputs if needed.) If you want to look at the version that failed with to_pandas instead of export, I think that was jobs 2077734 (parent) / 2077736 (3 failed of 22k).

Let me know what additional info would be useful, if any. Thanks in advance for the help!

query_batch_dump.txt (29.9 KB)

tpoterba · April 20, 2022, 8:07pm

I believe this is the same as a known issue with encodings that I’m looking into. I’ll have an update by Monday, I hope.

rkwalters · April 28, 2022, 5:07pm

Hi Tim, thanks for looking into this! If this is indeed related to your known issue, any suggestions on ways to potentially work around this for now?

tpoterba · April 28, 2022, 6:25pm

Hi Raymond, sorry I didn’t update as promised! The fix is actually merged and if that’s the problem, version 0.2.94 will work.

rkwalters · April 28, 2022, 10:19pm

Thanks for the update! I still get a fatal error with 0.2.94 (log attached), but the error log is longer with a bit more structured content so maybe that’s progress?

Batch jobs IDs are 2334458 (parent) and 2334493 (22k tasks), and this time all the child jobs report success but the parent job says failure. Log for that parent job blames a StepInterruptedError causing ContainerDeletedError.

query_batch_dump2.txt (691.0 KB)

tpoterba · April 29, 2022, 10:13am

@jigold could you take a look here? There’s nothing that looks like a stack trace from the compiler in this error dump.

tpoterba · April 29, 2022, 12:03pm

Raymond, one question that came up – is this replicable every time you submit?

rkwalters · May 4, 2022, 8:58pm

Appears replicable, at least in the python traceback without trying to go through the full dump with a fine tooth comb. That’s only for a couple attempts though, hit the end of trail billing credit so I’ll need to swap things over before I can test further.

Topic		Replies	Views
Yarn memory overhead Hail Query & hailctl	4	856	August 16, 2022
Timeout while writing to CSV Hail Query & hailctl	8	676	April 6, 2020
Table export report error Hail Query & hailctl	7	1050	July 30, 2020
EOFException Error in 'count_rows' Hail Query & hailctl	4	378	September 22, 2020
Issue in exporting VEP annotated entry/row tables Hail Query & hailctl	2	344	December 5, 2022

Fatal error at row table export after sample_rows/cols with Query-on-Batch in jupyter

Related topics