Connection Errors

Hello,
Hail version: 0.2.107-2387bb00ceee

We have had a persistent issue with hail within the All of Us environment. This is not a code specific issue as any random bit of code has been able to generate this error at times while potentially working at other times. I’m trying to understand the cause of the error and potential solutions. Is this a hail issue or an All of Us environment issue? The recommendations from support has been to increase memory in the worker nodes, but that has not been successful.

# Simple example of code that can replicate this issue
# The matrix table contains a 20kb region for 245k samples
def split_multi_allelic(mt):
    bi = mt.filter_rows(hl.len(mt.alleles) == 2)
    bi = bi.annotate_rows(a_index=1, was_split=False, old_locus=bi.locus, old_alleles=bi.alleles)
    multi = mt.filter_rows(hl.len(mt.alleles) > 2)
    split = hl.split_multi(multi)
    mt = split.union_rows(bi)
    return mt

mt = hl.read_matrix_table(mt_bucket)
mt = split_multi_allelic(mt)
mt.write(mt_bucket_new, overwrite=True)


Thanks

Hi @anh151 !

In the future, please just copy and paste the message into a code block (you can use three backticks ``` to create code block). Please also include the full stack trace. If either of these are too big to share here, you can create a gist at https://gist.github.com and paste it there.

You probably want hl.split_multi_hts not hl.split_multi unless you’ve already investigated both and decided on the latter.


The error message is a bit confusing. The Py4JNetworkError is just about how Python talks to Java, it’s not a real network issue. All the interesting information should come after the “An error occurred while calling o1.executeEncode”. In particular, I would expect a Java stack trace. There is also useful information in the Hail log file, can you copy that out and share here?

My guess is that this is not a memory issue because split_multi doesn’t use much memory. I can’t comment further without the other debugging information I requested above.

Hi Dan.
You’re right. I was using split_multi_hts. I think I just copied something wrong. Sorry about not including the full stack trace. This was originally part of email to our group to try and debug amongst ourselves.

Full stack trace example below for the same 20kb region and 245k samples. I don’t see any java traceback on my end. Attached is also the hail log file. Environment is 2 regular workers and 300 preemptible workers with 4 CPUs and 15 GB of RAM each.

def split_multi_allelic(mt):
    bi = mt.filter_rows(hl.len(mt.alleles) == 2)
    bi = bi.annotate_rows(a_index=1, was_split=False)
    multi = mt.filter_rows(hl.len(mt.alleles) > 2)
    split = hl.split_multi_hts(multi)
    mt = split.union_rows(bi)
    return mt

mt = hl.read_matrix_table(mt_two)
mt = split_multi_allelic(mt)
mt.write(mt_three, overwrite=True)
ERROR:root:Exception while sending command.                  (17 + 292) / 84648]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1224, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1229, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
/tmp/ipykernel_240/4270762384.py in <module>
----> 1 mt.write(mt_three, overwrite=True)

<decorator-gen-1336> in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)

/opt/conda/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/opt/conda/lib/python3.7/site-packages/hail/matrixtable.py in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)
   2582 
   2583         writer = ir.MatrixNativeWriter(output, overwrite, stage_locally, _codec_spec, _partitions, _partitions_type, _checkpoint_file)
-> 2584         Env.backend().execute(ir.MatrixWrite(self._mir, writer))
   2585 
   2586     class _Show:

/opt/conda/lib/python3.7/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     97         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     98         try:
---> 99             result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
    100             (result, timings) = (result_tuple._1(), result_tuple._2())
    101             value = ir.typ._from_encoding(result)

/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1321         answer = self.gateway_client.send_command(command)
   1322         return_value = get_return_value(
-> 1323             answer, self.gateway_client, self.target_id, self.name)
   1324 
   1325         for temp_arg in temp_args:

/opt/conda/lib/python3.7/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     19         import pyspark
     20         try:
---> 21             return f(*args, **kwargs)
     22         except py4j.protocol.Py4JJavaError as e:
     23             s = e.java_exception.toString()

/opt/conda/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    334             raise Py4JError(
    335                 "An error occurred while calling {0}{1}{2}".
--> 336                 format(target_id, ".", name))
    337     else:
    338         type = answer[1]

Py4JError: An error occurred while calling o1.executeEncode

hail log: hail-20230802-1848-0.2.107-2387bb00ceee · GitHub

Thanks for the added information.

Mysterious. Assuming that really is the entire log file, it looks like your driver process just unceremoniously died. How much RAM does your driver have? When creating Dataproc clusters with hailctl we default to a n1-highmem-16 which has a healthy amount of RAM, something like 100GiB.

This is the full environment specs. This is the default driver node set by All of Us for hail.

You probably don’t need 150GiB per worker disk. hailctl defaults to 40GB.

I assume the “CPUs” and “RAM” under the “Cloud compute profile” correspond to the driver node? 15/4 = 3.75, so this sounds like an n1-standard-4. That might be too small for working with 250,000 samples, but I’ll admit to not having a lot of good data on using Hail with that many samples. The gnomAD team are the real experts on that stuff. I’m pretty sure they use the default hailctl settings, so that’d be 16 CPUs and 120GB. I’ll admit that seems like a lot of RAM.

Yeah, that’s like 20KB per sample. Not nothing, but definitely pretty tight! I’m still surprised we don’t see any useful information in the hail logs. Is there a file with the name “hs_err_pid” or something like that in the working directory of your notebook/Python process?

Thanks for all of your help.

Yes. The cloud compute profile section is the driver. Unfortunately, AoU does not allow the worker disk to be less than 150GB. I don’t see a hs_err_pid file in the environment. I even tried looking for it using find recursively from my home directory.

I was watching the environment using top as this was running and memory usage by java only reached about 25% right before the error.

I tried increasing the driver node to 16 CPUs and 104 GB of RAM and it looks like that fixed the issue! Is there a resource that you can provide to help better understand how much resources/RAM is required in the driver node? Or can you provide a estimate that we should use? Should we always use 104 GB? Is it possible to improve error messages and/or warnings from hail in these scenarios?

-Andrew

1 Like

We don’t have any such resources other than the default values hard coded into hailctl dataproc start. Yeah, I recommend always using 16 CPUs with ~8GiB per CPU for the driver VM of a Spark cluster used for Hail.