Import_vcf, old to new hail switch: py4j.protocol.Py4JNetworkError: Answer from Java side is empty

I know there is another thread but it didn’t help me solving the issue:

For Hail 0.2.57 everything works for me perfectly, but when I tried using the latest version - 0.2.70 it gave the error. Here is the full stack trace:

Initializing Hail with default parameters...
Exception in thread "Thread-6" java.lang.NoClassDefFoundError: scala/Product$class
	at is.hail.relocated.org.json4s.NoTypeHints$.<init>(Formats.scala:429)
	at is.hail.relocated.org.json4s.NoTypeHints$.<clinit>(Formats.scala)
	at is.hail.utils.package$.<init>(package.scala:472)
	at is.hail.utils.package$.<clinit>(package.scala)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at py4j.reflection.CurrentThreadClassLoadingStrategy.classForName(CurrentThreadClassLoadingStrategy.java:40)
	at py4j.reflection.ReflectionUtil.classForName(ReflectionUtil.java:51)
	at py4j.reflection.TypeUtil.forName(TypeUtil.java:243)
	at py4j.commands.ReflectionCommand.getUnknownMember(ReflectionCommand.java:175)
	at py4j.commands.ReflectionCommand.execute(ReflectionCommand.java:87)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)INFO:py4j.java_gateway:Error while receiving.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1207, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 13 more
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1207, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1033, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1212, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR: [pid 28790] Worker Worker(salt=465004285, workers=1, host=ip-172-21-61-13, username=hadoop, pid=28790) failed    SeqrVCFToMTTask(source_paths=["s3://seqr-dp-data--prod/vcf/batch109_subset.vcf"], dest_path=s3://seqr-dp-build--qa/mt-hail-luigi/test/batch109_subset.mt, genome_version=38, array_elements_required=False, vep_runner=VEP, reference_ht_path=s3://combined_reference_data_grch38.ht, clinvar_ht_path=s3://clinvar.GRCh38.ht, hgmd_like_csv_path=s3://GRCh38_HGMD_2020_03_v2.csv, hgmd_ht_path=s3://hgmd_hg38.ht, cidr_ht_path=None, nisc_ht_path=s3://NISC.ht, bgi_ht_path=s3://BGI.ht, hgsc_wes_ht_path=None, hgsc_wgs_ht_path=s3://HGSC_WGS.ht, sample_type=WES, validate=False, dataset_type=VARIANTS, remap_path=, subset_path=)
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/luigi/worker.py", line 199, in run
    new_deps = self._run_get_new_deps()
  File "/usr/local/lib/python3.7/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
    task_gen = self.task.run()
  File "/home/hadoop/hail-elasticsearch-pipelines/luigi_pipeline/seqr_loading.py", line 51, in run
    self.read_vcf_write_mt()
  File "/home/hadoop/hail-elasticsearch-pipelines/luigi_pipeline/seqr_loading.py", line 54, in read_vcf_write_mt
    mt = self.import_vcf()
  File "/home/hadoop/hail-elasticsearch-pipelines/luigi_pipeline/lib/hail_tasks.py", line 105, in import_vcf
    force_bgz=True, min_partitions=500, array_elements_required=self.array_elements_required)
  File "<decorator-gen-1316>", line 2, in import_vcf
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 576, in wrapper
    args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 543, in check_all
    args_.append(arg_check(args[i], name, arg_name, checker))
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 584, in arg_check
    return checker.check(arg, function_name, arg_name)
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 82, in check
    return tc.check(x, caller, param)
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 328, in check
    return f(tc.check(x, caller, param))
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/genetics/reference_genome.py", line 10, in <lambda>
    reference_genome_type = oneof(transformed((str, lambda x: hl.get_reference(x))), rg_type)
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/context.py", line 554, in get_reference
    Env.hc()
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/utils/java.py", line 55, in hc
    init()
  File "<decorator-gen-1658>", line 2, in init
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/context.py", line 252, in init
    skip_logging_configuration, optimizer_iterations)
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/backend/spark_backend.py", line 163, in __init__
    self._utils_package_object = scala_package_object(hail_package.utils)
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/utils/java.py", line 122, in scala_package_object
    return scala_object(jpackage, 'package')
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/utils/java.py", line 118, in scala_object
    return getattr(getattr(jpackage, name + '$'), 'MODULE$')
  File "/usr/local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1644, in __getattr__
    raise Py4JError("{0} does not exist in the JVM".format(new_fqn))
py4j.protocol.Py4JError: is.hail.utils.package$ does not exist in the JVM

It fails when import_vcf is run:

hl.import_vcf([vcf_file for vcf_file in self.source_paths],
                             reference_genome='GRCh' + self.genome_version,
                             force_bgz=True, min_partitions=500, 
                             array_elements_required=self.array_elements_required)

What version of Spark are you using? I think between 0.2.57 and 0.2.70 we updated from Spark2 to Spark3 in PyPI artifacts.

Ok, we realized that we were using Spark 2, so we updated it and the issue was resolved but we faced a different one:

  File "/usr/local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/hadoop/.local/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 32, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
hail.utils.java.FatalError: UnsupportedFileSystemException: No FileSystem for scheme "s3"
Java stack trace:
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

When we ran to check if s3 is supported by Hail it gave us:


[hadoop@ip-4234 ~]$ python3
Python 3.7.10 (default, Jun  3 2021, 00:02:01) 
[GCC 7.3.1 20180712 (Red Hat 7.3.1-13)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> import hail
>>> 
>>> hail.utils.hadoop_scheme_supported('s3')
Initializing Hail with default parameters...
2021-06-24 20:10:39 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.1.2
SparkUI available at http://ip-34453.ec2.internal:2323
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.70-5bb98953a4a7
LOGGING: writing to /home/hadoop/hail-20210624-2010-0.2.70-5bb98953a4a7.log
False

What’s your runtime? Are you running locally and reading from S3?

Previously it was working just fine with 0.2.57, so I am not sure its related to some kind of an access issue. Its AWS EMR - emr-6.3.0 which accesses s3. But we can log in to cluster and work locally with it also.

How did you update Spark? Did you use a different EMR image/version? It looks like you have pyspark installed with pip:

  File "/usr/local/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__

You might try sshing into the driver node, running pip3 uninstall pyspark -y and see if that fixes things by using the EMR spark installation