Fatal error at row table export after sample_rows/cols with Query-on-Batch in jupyter

Hello,
I’m hoping to get some help with debugging a fatal error I got while running Query-on-Batch in a jupyter notebook on my laptop. The full traceback + error dump is in the attached txt file. Briefly, my workflow consists of:

  1. Read a matrix table on GCP
  2. Read a local tsv into pandas and make hail dict
  3. Aggressively downsample rows and columns in matrix table
  4. Annotate columns of matrix table from hail dict
  5. Annotate rows with an aggregate grouped by the new column annotation
  6. Get row table
  7. Format and export row table

The error traceback points to the final export, but I’m guessing from run times etc. that everything here is waiting on lazy evals until hitting the export. Previous testing with this workflow had been working fine with (a) using head instead of sample_rows/cols at step 3, and (b) sending the row table to_pandas instead of export at step 7. Trying to run with sample_rows/cols and to_pandas also hit a similar looking fatal error, but I didn’t save that error message.

On the batch side, job 2077741 is the culprit with 2 failing jobs (of 22925), but I don’t know enough to parse those java errors. From submission behavior it looked like job 2077740 is also related as the parent/control job. (I’m assuming you have already have access to the batch logs from your end, but happy to grab the those outputs if needed.) If you want to look at the version that failed with to_pandas instead of export, I think that was jobs 2077734 (parent) / 2077736 (3 failed of 22k).

Let me know what additional info would be useful, if any. Thanks in advance for the help!

query_batch_dump.txt (29.9 KB)

I believe this is the same as a known issue with encodings that I’m looking into. I’ll have an update by Monday, I hope.

1 Like

Hi Tim, thanks for looking into this! If this is indeed related to your known issue, any suggestions on ways to potentially work around this for now?

Hi Raymond, sorry I didn’t update as promised! The fix is actually merged and if that’s the problem, version 0.2.94 will work.

Thanks for the update! I still get a fatal error with 0.2.94 (log attached), but the error log is longer with a bit more structured content so maybe that’s progress?

Batch jobs IDs are 2334458 (parent) and 2334493 (22k tasks), and this time all the child jobs report success but the parent job says failure. Log for that parent job blames a StepInterruptedError causing ContainerDeletedError.

query_batch_dump2.txt (691.0 KB)

@jigold could you take a look here? There’s nothing that looks like a stack trace from the compiler in this error dump.

Raymond, one question that came up – is this replicable every time you submit?

Appears replicable, at least in the python traceback without trying to go through the full dump with a fine tooth comb. That’s only for a couple attempts though, hit the end of trail billing credit so I’ll need to swap things over before I can test further.