Error when running pc_relate: Exception while sending command & Answer from Java side is empty

Hello,

I am getting the following error when I run pc_relate on 4k individuals with 100k markers:

ERROR:root:Exception while sending command.                         (0 + 1) / 4]
Traceback (most recent call last):
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-1708>", line 2, in pc_relate
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/hail/methods/relatedness/pc_relate.py", line 344, in pc_relate
    ht = Table(ir.BlockMatrixToTableApply(g._bmir, pcs._ir, {
  File "<decorator-gen-1110>", line 2, in persist
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/hail/table.py", line 2127, in persist
    return Env.backend().persist_table(self)
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/hail/backend/backend.py", line 163, in persist_table
    return t.checkpoint(tf.__enter__())
  File "<decorator-gen-1100>", line 2, in checkpoint
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/hail/table.py", line 1346, in checkpoint
    self.write(output=output, overwrite=overwrite, stage_locally=stage_locally, _codec_spec=_codec_spec)
  File "<decorator-gen-1102>", line 2, in write
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/hail/table.py", line 1392, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/hail/backend/py4j_backend.py", line 76, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/hail/backend/py4j_backend.py", line 25, in deco
    return f(*args, **kwargs)
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/py4j/protocol.py", line 334, in get_return_value
    raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling o1.executeEncode

I am using Java 8 and Spark 3.3.2. My code is as follows:

pc_rel = hl.pc_relate(d.GT, k=10, min_individual_maf=0.05, statistics='kin') 

Any assistance is greatly appreciated!

Cheers,

Angus

Hi @acburns ! I’m sorry to hear you’re having trouble. This looks like the JVM crashed. Are you using a cluster, a laptop, a server something else? How much memory does that have?

If you’re on a laptop or server, are you telling Hail how much RAM to use?

Hi @danking - I am using the Broad server with ‘ish -l h_vmem=50G’. I used the command you recommended above to start an ipython shell with 50G ram (changed the 8 to 50) and noticed that the pc_relate function ran faster but still crashed at the same point (Stage 88) with the a similar error message (seems like the last few lines are different?; see below).

Also, I tried using this method also mentioned in the thread and got the same result:

hl.init(spark_conf={'spark.driver.memory': '50g'},tmp_dir='/fg/saxenalab/data/GnomAD/GnomAD_HGDP-1KG/PC_Relate/HG38', local_tmpdir='/fg/saxenalab/data/GnomAD/GnomAD_HGDP-1KG/PC_Relate/HG38')

I’ll include the log file here as well - perhaps that could help diagnose this issue? It’s very hard to parse on my end.
hail-20230422-2108-0.2.113-cf32652c5077.log (4.9 MB)

Error message:

[Stage 88:>                                                         (0 + 1) / 4]

ERROR:root:Exception while sending command.                         (0 + 1) / 4]
Traceback (most recent call last):
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<ipython-input-4-83c7d140caf9> in <module>
----> 1 pc_rel = hl.pc_relate(d.GT, k=10, min_individual_maf=0.05, statistics='kin')

<decorator-gen-1816> in pc_relate(call_expr, min_individual_maf, k, scores_expr, min_kinship, statistics, block_size, include_self_kinship)

~/.local/lib/python3.9/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    582     def wrapper(__original_func, *args, **kwargs):
    583         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 584         return __original_func(*args_, **kwargs_)
    585
    586     return wrapper

~/.local/lib/python3.9/site-packages/hail/methods/relatedness/pc_relate.py in pc_relate(call_expr, min_individual_maf, k, scores_expr, min_kinship, statistics, block_size, include_self_kinship)
    342     pcs = scores_table.collect(_localize=False).map(lambda x: x.__scores)
    343
--> 344     ht = Table(ir.BlockMatrixToTableApply(g._bmir, pcs._ir, {
    345         'name': 'PCRelate',
    346         'maf': min_individual_maf,

<decorator-gen-1218> in persist(self, storage_level)

~/.local/lib/python3.9/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    582     def wrapper(__original_func, *args, **kwargs):
    583         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 584         return __original_func(*args_, **kwargs_)
    585
    586     return wrapper

~/.local/lib/python3.9/site-packages/hail/table.py in persist(self, storage_level)
   2125             Persisted table.
   2126         """
-> 2127         return Env.backend().persist_table(self)
   2128
   2129     def unpersist(self) -> 'Table':

~/.local/lib/python3.9/site-packages/hail/backend/backend.py in persist_table(self, t)
    161         tf = TemporaryFilename(prefix='persist_table')
    162         self._persisted_locations[t] = tf
--> 163         return t.checkpoint(tf.__enter__())
    164
    165     def unpersist_table(self, t):

<decorator-gen-1208> in checkpoint(self, output, overwrite, stage_locally, _codec_spec, _read_if_exists, _intervals, _filter_intervals)

~/.local/lib/python3.9/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    582     def wrapper(__original_func, *args, **kwargs):
    583         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 584         return __original_func(*args_, **kwargs_)
    585
    586     return wrapper

~/.local/lib/python3.9/site-packages/hail/table.py in checkpoint(self, output, overwrite, stage_locally, _codec_spec, _read_if_exists, _intervals, _filter_intervals)
   1344
   1345         if not _read_if_exists or not hl.hadoop_exists(f'{output}/_SUCCESS'):
-> 1346             self.write(output=output, overwrite=overwrite, stage_locally=stage_locally, _codec_spec=_codec_spec)
   1347             _assert_type = self._type
   1348             _load_refs = False

<decorator-gen-1210> in write(self, output, overwrite, stage_locally, _codec_spec)

~/.local/lib/python3.9/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    582     def wrapper(__original_func, *args, **kwargs):
    583         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 584         return __original_func(*args_, **kwargs_)
    585
    586     return wrapper

~/.local/lib/python3.9/site-packages/hail/table.py in write(self, output, overwrite, stage_locally, _codec_spec)
   1390         hl.current_backend().validate_file_scheme(output)
   1391
-> 1392         Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
   1393
   1394     @typecheck_method(output=str,

~/.local/lib/python3.9/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     74         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     75         try:
---> 76             result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
     77             (result, timings) = (result_tuple._1(), result_tuple._2())
     78             value = ir.typ._from_encoding(result)

~/.local/lib/python3.9/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1319
   1320         answer = self.gateway_client.send_command(command)
-> 1321         return_value = get_return_value(
   1322             answer, self.gateway_client, self.target_id, self.name)
   1323

~/.local/lib/python3.9/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     23         import pyspark
     24         try:
---> 25             return f(*args, **kwargs)
     26         except py4j.protocol.Py4JJavaError as e:
     27             s = e.java_exception.toString()

~/.local/lib/python3.9/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    332                     format(target_id, ".", name, value))
    333         else:
--> 334             raise Py4JError(
    335                 "An error occurred while calling {0}{1}{2}".
    336                 format(target_id, ".", name))

Py4JError: An error occurred while calling o1.executeEncode

In [5]: ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/home/unix/aburns/.local/lib/python3.9/site-packages/py4j/clientserver.py", line 539, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving

Just to be clear, you specified both the spark driver memory argument and the spark executor memory argument? The log indicates that your worker processes are dying. That usually means the workers have insufficient RAM.

It also appears that you’re using quite a lot of partitions which can make the PCA portion of pc_relate take much longer than necessary. Can you share the full script you’re running?

Hi Dan,

Yep I increased the memory for both driver and executor.

When you say I am using a lot of partitions are you referring to the block_size argument? Do you recommend reducing this to a specific number?

I have attached the example sequence of code I am using.

Regards,

Angus

Example_PC_relate_script.txt (961 Bytes)

You should write out to Hail Matrix Table format before running pc_relate. Generally, Hail will be faster and less memory hungry when reading directly from its native format.

I’m referring to the partitioning of a dataset. Every dataset in Hail is partitioned after import/read. You can change partitioning (note: this can be time consuming as it often means moving data around). You generally want ~128 MiB per partition. PCA is particularly poorly behaved (unfortunately we don’t control it, it’s implemented in Apache Spark, and underlying library). PCA prefers just tens of partitions. See the pc_relate examples for an example of doing PCA separately from pc_relate. The call to PCA can be:

dataset_for_pca = dataset.naive_coalesce(100)
_, scores_table, _ = hl.hwe_normalized_pca(dataset_for_pca.GT,
                                           k=10,
                                           compute_loadings=False)

This assumes that your dataset has many more than 100 partitions and that going to 100 is a substantial improvement. Check the number of partitions with n_partitions().

Make sure you write the results as a Hail Table as well. Generally, exporting directly to text is a bad idea in Hail. It uses more memory, is slower, and is less reliable than writing to a Hail native format and then reading and exporting afterwards. That isolates the export from the most speed-sensitive portion of the pipeline.

Thanks for your help Dan, I ended up getting this to work by setting

block_size=1024

I did notice that this PC_relate method and the Hail KING method took quite a long time to run compared to a KING run in PLINK.

That’s expected. PC-Relate is a fair bit more complicated than KING.

PLINK uses a representation of genotypes that’s at least 16 times smaller and possibly much smaller than Hail’s generic representation. PLINK also uses vector intrinsic that Hail, sadly, cannot yet use due to our unfortunate choice of the JVM.

If you need the fastest possible implementation of KING you should absolutely use PLINK.