Zip: length mismatch error

Hi,

If I run the code in the attached file
am_test.txt (1.3 KB)
I get the below error:

> 
> 2023-08-09 15:56:05.872 Hail: INFO: wrote table with 104 rows in 104 partitions to gs://gnomad-tmp-4day/__iruid_845-wg8w0gBbPsilRDfRegbVW3
INFO (constraint_pipeline 463): Copying log to logging bucket...
2023-08-09 15:56:33.276 Hail: INFO: copying log to 'gs://gnomad-tmp/gnomad_v2.1.1_testing/constraint/logging/constraint_pipeline.log'...
Traceback (most recent call last):
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 785, in <module>
    main(args)
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/constraint_pipeline.py", line 412, in main
    oe_ht = apply_models(
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 396, in apply_models
  File "/tmp/17e6dae80e9548a9a94895fd07e6dee0/pyscripts_ge6ozu1m.zip/gnomad_constraint/utils/constraint.py", line 279, in create_observed_and_possible_ht
  File "<decorator-gen-1118>", line 2, in checkpoint
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1331, in checkpoint
    self.write(output=output, overwrite=overwrite, stage_locally=stage_locally, _codec_spec=_codec_spec)
  File "<decorator-gen-1120>", line 2, in write
  File "/opt/conda/default/lib/python3.10/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.10/site-packages/hail/table.py", line 1377, in write
    Env.backend().execute(ir.TableWrite(self._tir, ir.TableNativeWriter(output, overwrite, stage_locally, _codec_spec)))
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 82, in execute
    raise e.maybe_user_error(ir) from None
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 76, in execute
    result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
  File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/opt/conda/default/lib/python3.10/site-packages/hail/backend/py4j_backend.py", line 35, in deco
    raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
hail.utils.java.FatalError: HailException: zip: length mismatch: 62164, 104

Java stack trace:
is.hail.utils.HailException: zip: length mismatch: 62164, 104
	at __C8160Compiled.__m8201split_ToArray(Emit.scala)
	at __C8160Compiled.__m8169split_CollectDistributedArray(Emit.scala)
	at __C8160Compiled.__m8164split_Let(Emit.scala)
	at __C8160Compiled.apply(Emit.scala)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$4(CompileAndEvaluate.scala:61)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2(CompileAndEvaluate.scala:61)
	at is.hail.expr.ir.CompileAndEvaluate$.$anonfun$_apply$2$adapted(CompileAndEvaluate.scala:59)
	at is.hail.backend.ExecuteContext.$anonfun$scopedExecution$1(ExecuteContext.scala:140)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext.scopedExecution(ExecuteContext.scala:140)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:59)
	at is.hail.expr.ir.CompileAndEvaluate$.evalToIR(CompileAndEvaluate.scala:33)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:30)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:58)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:63)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:22)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:20)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:20)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:50)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:462)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:498)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:75)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:75)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:63)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:350)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:495)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:494)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)



Hail version: 0.2.120-f00f916faf78
Error summary: HailException: zip: length mismatch: 62164, 104

One of the input tables (context_ht) has 8061724269 rows and 62164 partitions, and another input_table (mutation_ht) has 104 rows and 104 partitions. If I add “mutation_ht = mutation_ht = mutation_ht.repartition(1)” at line 37, the script runs fine. If I add “mutation_ht = mutation_ht.naive_coalesce(1)” it gives the same error.

Why does the original partitioning of the input tables result in an error?
Why does repartition make it work but naive_coalesce doesn’t?
How do you typically choose whether to use repartition vs naive_coalesce?

Thank you!

Which is here:

I was able to download the log file and will be investigating more now.

@klaricch Sorry to bother you, but the most recent log is one from a successful run. Do you have a log from a failed one?

I’ll rerun it quickly and send an example of a failed one

I’m getting a similar (length mismatch) error while running RMC specifically this script. Mu cluster config is below.

--master-machine-type=n1-standard-8 \
    --master-boot-disk-size=100GB \
    --num-master-local-ssds=0 \
    --num-secondary-workers=4 \
    --num-worker-local-ssds=0 \
    --num-workers=2 \
    --secondary-worker-boot-disk-size=40GB \
    --worker-boot-disk-size=40GB \
    --worker-machine-type=n1-standard-8 \
    --initialization-action-timeout=20m \
    --network=panchal-sandbox-network \
    --project=panchal-sandbox-b3b6 \
    --labels=creator=rpanchal_broadinstitute_org \
    --max-idle=15m \
    --tags=dataproc-node,ssh-broad \
    --autoscaling-policy=max-20 \

The specific command is

rmc/pipeline/two_breaks/run_batches_dataproc.py
--run-sections-over-threshold \
    --search-num \
    2 \
    --freeze \
    7 \
    --group-size \
    50

The hail log is attached
round2search_for_two_breaks_run_batches_dataproc.log (5.1 MB)

Possibly this line is causing it? I’ll see if removing naive_coalesce fixes it like it did for @klaricch