Missing part in matrix table?

I have some data in a matrix table stored in s3 in the US-west region. I’d like to merge this data with the [ 1000 Genomes HighCov autosomes data. Anticipating (correctly) that this would not be a straightforward, one-time thing, I made a US-west copy of all the objects that make up the matrix table in s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt .

My attempt to merge these datasets looks more or less like this (I’m using version 0.2.72-cfce5e858cab)

my_mt = hl.read_matrix_table('s3a://my_bucket/my_cohort.mt')
tgp_mt = hl.read_matrix_table('s3a://my_bycket/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt')
tgp_mt = tgp_mt.select_cols()
my_tgp_ekeys= my_mt.entry.keys() & tgp_mt.entry.keys()
my_mt = my_mt.select_entries(*my_tgp_ekeys)
tgp_mt = tgp_mt.select_entries(*my_tgp_ekeys)
my_tgp_mt = tgp_mt.union_cols(my_mt)
my_tgp_mt = hl.sample_qc(my_tgp_mt,name="sample_qc")
my_tgp_mt.write('s3a://my_bucket/my_tgp_cohort.mt')

What ends up happening is I will get FileNotFoundException:
No such file or directory: s3a://my_bucket/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620

If I do a aws s3 ls --no-sign-request s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/ I can confirm that part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620 is not there. When I compare the metadata.json.gz's _partFile entries I do find the missing part. It appears that there are both s3 objects not listed in the metadata.json.gz as well as objects in the metadata.json.gz file that do not appear in s3.

I guess my question is: what’s going on here? Is this the right way to merge two cohorts from two matrix tables? Are there actually parts missing from the GRCh38 30x 1000 genomes matrix table?

The full backtrace is :

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 947 in stage 3.0 failed 10 times, most recent failure: Lost task 947.9 in stage 3.0 (TID 2149) (172.18.6.11 executor 27): java.io.FileNotFoundException: No such file or directory: s3a://my_bucket/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2269)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2163)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2102)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:702)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
	at is.hail.io.fs.HadoopFS.openNoCompression(HadoopFS.scala:83)
	at is.hail.io.fs.FS.open(FS.scala:139)
	at is.hail.io.fs.FS.open$(FS.scala:138)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.io.fs.FS.open(FS.scala:151)
	at is.hail.io.fs.FS.open$(FS.scala:150)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.io.fs.FS.open(FS.scala:148)
	at is.hail.io.fs.FS.open$(FS.scala:147)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.HailContext$.$anonfun$readRowsSplit$5(HailContext.scala:383)
	at is.hail.sparkextras.IndexReadRDD.compute(IndexReadRDD.scala:25)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at is.hail.sparkextras.ContextRDD.iterator(ContextRDD.scala:390)
	at is.hail.sparkextras.RepartitionedOrderedRDD2$$anon$1.$anonfun$parentIterator$1(RepartitionedOrderedRDD2.scala:66)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at is.hail.sparkextras.RepartitionedOrderedRDD2$$anon$1.dropLeft(RepartitionedOrderedRDD2.scala:76)
	at is.hail.sparkextras.RepartitionedOrderedRDD2$$anon$1.<init>(RepartitionedOrderedRDD2.scala:73)
	at is.hail.sparkextras.RepartitionedOrderedRDD2.$anonfun$compute$1(RepartitionedOrderedRDD2.scala:62)
	at is.hail.io.RichContextRDDLong$.$anonfun$boundary$4(RichContextRDDRegionValue.scala:188)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at is.hail.io.RichContextRDDLong$$anon$3.hasNext(RichContextRDDRegionValue.scala:197)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
	at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1087)
	at is.hail.utils.richUtils.RichIterator$$anon$1.isValid(RichIterator.scala:30)
	at is.hail.utils.StagingIterator.isValid(FlipbookIterator.scala:48)
	at is.hail.utils.FlipbookIterator$$anon$6.calculateValidity(FlipbookIterator.scala:221)
	at is.hail.utils.FlipbookIterator$ValidityCachingStateMachine.refreshValidity(FlipbookIterator.scala:210)
	at is.hail.utils.FlipbookIterator$ValidityCachingStateMachine.refreshValidity$(FlipbookIterator.scala:209)
	at is.hail.utils.FlipbookIterator$$anon$6.refreshValidity(FlipbookIterator.scala:219)
	at is.hail.utils.FlipbookIterator$ValidityCachingStateMachine.$init$(FlipbookIterator.scala:214)
	at is.hail.utils.FlipbookIterator$$anon$6.<init>(FlipbookIterator.scala:219)
	at is.hail.utils.FlipbookIterator.staircased(FlipbookIterator.scala:219)
	at is.hail.utils.FlipbookIterator.cogroup(FlipbookIterator.scala:254)
	at is.hail.utils.FlipbookIterator.innerJoin(FlipbookIterator.scala:360)
	at is.hail.annotations.OrderedRVIterator.innerJoin(OrderedRVIterator.scala:116)
	at is.hail.rvd.KeyedRVD.$anonfun$orderedJoin$1(KeyedRVD.scala:66)
	at is.hail.rvd.KeyedRVD.$anonfun$orderedJoin$5(KeyedRVD.scala:86)
	at is.hail.sparkextras.ContextRDD.$anonfun$czipPartitions$2(ContextRDD.scala:316)
	at is.hail.sparkextras.ContextRDD.$anonfun$cmapPartitions$3(ContextRDD.scala:218)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
	at is.hail.utils.richUtils.RichContextRDD$$anon$1.hasNext(RichContextRDD.scala:71)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2291)
	at is.hail.rvd.RVD.combine(RVD.scala:725)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:913)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:56)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.rewrite$1(InterpretNonCompilable.scala:39)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:12)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:29)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:381)
	at is.hail.backend.spark.SparkBackend.$anonfun$execute$1(SparkBackend.scala:365)
	at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:627)
	at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:627)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:362)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeJSON$1(SparkBackend.scala:406)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:404)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

java.io.FileNotFoundException: No such file or directory: s3a://my_bucket/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620
	at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2269)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2163)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2102)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:702)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
	at is.hail.io.fs.HadoopFS.openNoCompression(HadoopFS.scala:83)
	at is.hail.io.fs.FS.open(FS.scala:139)
	at is.hail.io.fs.FS.open$(FS.scala:138)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.io.fs.FS.open(FS.scala:151)
	at is.hail.io.fs.FS.open$(FS.scala:150)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.io.fs.FS.open(FS.scala:148)
	at is.hail.io.fs.FS.open$(FS.scala:147)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:70)
	at is.hail.HailContext$.$anonfun$readRowsSplit$5(HailContext.scala:383)
	at is.hail.sparkextras.IndexReadRDD.compute(IndexReadRDD.scala:25)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at is.hail.sparkextras.ContextRDD.iterator(ContextRDD.scala:390)
	at is.hail.sparkextras.RepartitionedOrderedRDD2$$anon$1.$anonfun$parentIterator$1(RepartitionedOrderedRDD2.scala:66)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at is.hail.sparkextras.RepartitionedOrderedRDD2$$anon$1.dropLeft(RepartitionedOrderedRDD2.scala:76)
	at is.hail.sparkextras.RepartitionedOrderedRDD2$$anon$1.<init>(RepartitionedOrderedRDD2.scala:73)
	at is.hail.sparkextras.RepartitionedOrderedRDD2.$anonfun$compute$1(RepartitionedOrderedRDD2.scala:62)
	at is.hail.io.RichContextRDDLong$.$anonfun$boundary$4(RichContextRDDRegionValue.scala:188)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at is.hail.io.RichContextRDDLong$$anon$3.hasNext(RichContextRDDRegionValue.scala:197)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
	at scala.collection.Iterator$$anon$22.hasNext(Iterator.scala:1087)
	at is.hail.utils.richUtils.RichIterator$$anon$1.isValid(RichIterator.scala:30)
	at is.hail.utils.StagingIterator.isValid(FlipbookIterator.scala:48)
	at is.hail.utils.FlipbookIterator$$anon$6.calculateValidity(FlipbookIterator.scala:221)
	at is.hail.utils.FlipbookIterator$ValidityCachingStateMachine.refreshValidity(FlipbookIterator.scala:210)
	at is.hail.utils.FlipbookIterator$ValidityCachingStateMachine.refreshValidity$(FlipbookIterator.scala:209)
	at is.hail.utils.FlipbookIterator$$anon$6.refreshValidity(FlipbookIterator.scala:219)
	at is.hail.utils.FlipbookIterator$ValidityCachingStateMachine.$init$(FlipbookIterator.scala:214)
	at is.hail.utils.FlipbookIterator$$anon$6.<init>(FlipbookIterator.scala:219)
	at is.hail.utils.FlipbookIterator.staircased(FlipbookIterator.scala:219)
	at is.hail.utils.FlipbookIterator.cogroup(FlipbookIterator.scala:254)
	at is.hail.utils.FlipbookIterator.innerJoin(FlipbookIterator.scala:360)
	at is.hail.annotations.OrderedRVIterator.innerJoin(OrderedRVIterator.scala:116)
	at is.hail.rvd.KeyedRVD.$anonfun$orderedJoin$1(KeyedRVD.scala:66)
	at is.hail.rvd.KeyedRVD.$anonfun$orderedJoin$5(KeyedRVD.scala:86)
	at is.hail.sparkextras.ContextRDD.$anonfun$czipPartitions$2(ContextRDD.scala:316)
	at is.hail.sparkextras.ContextRDD.$anonfun$cmapPartitions$3(ContextRDD.scala:218)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
	at is.hail.utils.richUtils.RichContextRDD$$anon$1.hasNext(RichContextRDD.scala:71)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:488)
	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
	at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
	at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)




Hail version: 0.2.72-cfce5e858cab
Error summary: FileNotFoundException: No such file or directory: s3a://my_bucket/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620

This might indicate a failure to copy over the file correctly – I think it was copied over from a file on Google Cloud.

I can see that part file in Google:

$ gsutil ls gs://hail-datasets-us/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620
gs://hail-datasets-us/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620

cc @danking any ideas?

Hey @CreRecombinase !

I’m really sorry you’re having trouble with the Hail datasets. It appears that, due to a not yet understood error, nine files failed to copy from GCS to S3. The nine files are:

part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620
part-00319-7-319-0-9b13bfeb-a79d-03e9-0625-872135ff2cd5
part-00323-7-323-0-f10b20b8-e33e-c1ae-62f0-2b09276483a5
part-00330-7-330-0-9a26e1c3-d1a4-a42a-a05c-119d81320d57
part-00331-7-331-0-dde65b4b-744e-aecf-15e9-61930f92a7eb
part-00334-7-334-0-cf54a5dc-de73-5375-a518-033129181b9d
part-06648-7-6648-0-b254866b-09f8-9a9d-6d44-ff2810d20080
part-06659-7-6659-0-a046b619-4719-cedc-9741-69cc9c2473ea
part-06667-7-6667-0-4b850f2b-be02-4213-d47a-abf1636839d9

These are all entries data files. There are no missing row, column, or global data files.

I have restored these nine files from GCS to S3. You can fix your copy of the dataset by executing this script, substituting in the name of your bucket:

YOUR_BUCKET=your-bucket

aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620 \
          s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00311-7-311-0-ae371ed3-1c91-eca9-9251-acc1b0de3620
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00319-7-319-0-9b13bfeb-a79d-03e9-0625-872135ff2cd5 \
          s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00319-7-319-0-9b13bfeb-a79d-03e9-0625-872135ff2cd5
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00323-7-323-0-f10b20b8-e33e-c1ae-62f0-2b09276483a5 \
          s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00323-7-323-0-f10b20b8-e33e-c1ae-62f0-2b09276483a5
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00330-7-330-0-9a26e1c3-d1a4-a42a-a05c-119d81320d57 \
          s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00330-7-330-0-9a26e1c3-d1a4-a42a-a05c-119d81320d57
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00331-7-331-0-dde65b4b-744e-aecf-15e9-61930f92a7eb \
          s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00331-7-331-0-dde65b4b-744e-aecf-15e9-61930f92a7eb
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00334-7-334-0-cf54a5dc-de73-5375-a518-033129181b9d \
          s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-00334-7-334-0-cf54a5dc-de73-5375-a518-033129181b9d
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06648-7-6648-0-b254866b-09f8-9a9d-6d44-ff2810d20080 \
          s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06648-7-6648-0-b254866b-09f8-9a9d-6d44-ff2810d20080
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06659-7-6659-0-a046b619-4719-cedc-9741-69cc9c2473ea \
          s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06659-7-6659-0-a046b619-4719-cedc-9741-69cc9c2473ea
aws s3 cp s3://hail-datasets-us-east-1/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06667-7-6667-0-4b850f2b-be02-4213-d47a-abf1636839d9 \
          s3://$YOUR_BUCKET/1000_Genomes/NYGC_30x/GRCh38/autosomes_unphased.mt/entries/rows/parts/part-06667-7-6667-0-4b850f2b-be02-4213-d47a-abf1636839d9

Looks like that worked, thanks for the help!