Error when running vds combiner

Hi

Can anyone give me any insight into this error?

I am running hail vds combiner using the following code:

combiner = hl.vds.new_combiner(
    output_path=vds_file,
    temp_path=tmp_dir,
    gvcf_paths=gvcfs_to_load,
    use_exome_default_intervals=True,
    )
combiner.run()

On a list of 50 gVCFs it runs fine, but with a list of 250 gVCFs I am getting an error about a file not found:

hail.utils.java.FatalError: RemoteException: File does not exist: /lustre/scratch123/teams/hgi/re3/ibd_gvcf_mts/tmp/combiner-intermediates/d1a3a9db-d9b9-4163-82bc-862246272dfe_gvcf-combine_job1/dataset_0.vds/reference_data/index/part-04-4-4-0-e0246096-65a7-c7e4-21e1-be3eaca1f988.idx/metadata.json.gz

This error is confusing as the file does exist. The gVCFs I am loading are all fine (I have used them for joint calling and not seen any errors)

I am running hail version 0.2.105-acd89e80c345

The output and beginning of the error are as follows:

LOGGING: writing to /home/ubuntu/hail_gvcf_combining/hail-20230127-1414-0.2.105-acd89e80c345.log
2023-01-27 14:14:25.932 Hail: INFO: scanning VCF for sortedness…
2023-01-27 14:14:34.711 Hail: INFO: Coerced sorted VCF - no additional import work to do
2023-01-27 14:14:41.117 Hail: WARN: generated combiner save path of file:///lustre/scratch123/teams/hgi/re3/ibd_gvcf_mts/tmp/combiner-plans/vds-combiner-plan_c293b9b56e98ee356c086d308bd8858af2f87537f51916ddb6eaf7c0a4d8fd3b_0.2.105.json
2023-01-27 14:14:41.488 Hail: INFO: Running VDS combiner:
VDS arguments: 0 datasets with 0 samples
GVCF arguments: 250 inputs/samples
Branch factor: 100
GVCF merge batch size: 50
2023-01-27 14:14:41.560 Hail: INFO: GVCF combine (job 1): merging 250 GVCFs into 3 datasets
2023-01-27 14:20:09.690 Hail: INFO: VDS Combine (job 2): merging 3 datasets with 250 samples
2023-01-27 14:20:14.532 Hail: INFO: wrote table with 47322672 rows in 65 partitions to file:///lustre/scratch123/teams/hgi/re3/ibd_gvcf_mts/tmp/combiner-intermediates/d1a3a9db-d9b9-4163-82bc-862246272dfe_vds-combine_job2/interval_checkpoint.ht
Traceback (most recent call last):
File “/home/ubuntu/hail_gvcf_combining/combine_gvcfs.py”, line 64, in
main()
File “/home/ubuntu/hail_gvcf_combining/combine_gvcfs.py”, line 53, in main
load_gvcfs(gvcfs_to_load, mtdir, tmp_dir)
File “/home/ubuntu/hail_gvcf_combining/combine_gvcfs.py”, line 17, in load_gvcfs
combiner.run()
File “/home/ubuntu/venv/lib/python3.8/site-packages/hail/vds/combiner/variant_dataset_combiner.py”, line 344, in run
self.step()
File “/home/ubuntu/venv/lib/python3.8/site-packages/hail/vds/combiner/variant_dataset_combiner.py”, line 401, in step
self._step_vdses()
File “/home/ubuntu/venv/lib/python3.8/site-packages/hail/vds/combiner/variant_dataset_combiner.py”, line 446, in _step_vdses
combined.write(self._output_path)
File “/home/ubuntu/venv/lib/python3.8/site-packages/hail/vds/variant_dataset.py”, line 125, in write
self.reference_data.write(VariantDataset._reference_path(path), **kwargs)
File “”, line 2, in write
File “/home/ubuntu/venv/lib/python3.8/site-packages/hail/typecheck/check.py”, line 577, in wrapper
return original_func(*args, **kwargs)
File “/home/ubuntu/venv/lib/python3.8/site-packages/hail/matrixtable.py”, line 2584, in write
Env.backend().execute(ir.MatrixWrite(self._mir, writer))
File “/home/ubuntu/venv/lib/python3.8/site-packages/hail/backend/py4j_backend.py”, line 104, in execute
self._handle_fatal_error_from_backend(e, ir)
File “/home/ubuntu/venv/lib/python3.8/site-packages/hail/backend/backend.py”, line 181, in _handle_fatal_error_from_backend
raise err
File “/home/ubuntu/venv/lib/python3.8/site-packages/hail/backend/py4j_backend.py”, line 98, in execute
result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
File “/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py”, line 1304, in call
File “/home/ubuntu/venv/lib/python3.8/site-packages/hail/backend/py4j_backend.py”, line 31, in deco
raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: RemoteException: File does not exist: /lustre/scratch123/teams/hgi/re3/ibd_gvcf_mts/tmp/combiner-intermediates/d1a3a9db-d9b9-4163-82bc-862246272dfe_gvcf-combine_job1/dataset_0.vds/reference_data/index/part-04-4-4-0-e0246096-65a7-c7e4-21e1-be3eaca1f988.idx/metadata.json.gz
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2071)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:773)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

That is indeed very confusing. It appears to me that HDFS is failing somehow.

Can you use a filesystem other than HDFS? In general, we find that HDFS unreliable. In the cloud, we strongly recommend using GCS, S3, or Azure Blob Storage.

Another thing to try: specify all your paths as file:///foo/bar/baz to ensure that nothing uses HDFS. Unfortunately, by default, Spark assumes every path without a protocol is an HDFS path.