"No space left on device" error

We are consistently seeing this class of error when doing various operations in Hail:

java.io.IOException: No space left on device
  at java.io.FileOutputStream.writeBytes(Native Method)
  at java.io.FileOutputStream.write(FileOutputStream.java:326)
  at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
  at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
  at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:220)
  at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:173)
  at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:252)
  at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:211)
  at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:419)
  at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
  at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

In this case, something to the effect of the following was running on GCP, using hailctl, tested with 0.2.19 and 0.2.20 (testing in 0.2.18 is currently in progress):

mt = hail.import_vcf(vcfs)

if mt.n_partitions() > PARTITIONS:
    mt = mt.repartition(PARTITIONS)

mt.write(f"gs://my-bucket/foo.mt", overwrite=True)

On GCP, the logs then indicate that the executor slaves are then systematically lost. The job doesn’t fail at this point, but it just sits there doing nothing:

java.io.IOException: Failed to send RPC RPC 5026120229335671966 to /10.154.0.146:50726: java.nio.channels.ClosedChannelException
  at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:357)
  at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:334)
  at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
  at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
  at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
  at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
  at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
  at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
  at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
  at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
  at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
  at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
  at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
  at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
  at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
  at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
  at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
  at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
  at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

We have not experienced the loss of slaves on our own, internal clusters, but we are consistently seeing the “No space left on device” error with Hail 0.2.20 here. It appears to be related to temporary space, but our attempts to provision more space have not made a difference (we’ve tried up to 0.5TiB of /tmp).

This means you don’t have enough local disk. I don’t know where Spark stores its shuffle intermediates, but it might not be /tmp. The real answer is don’t use repartition, it is slow and uses a Spark shuffle and is thus prone to failure. Use the min_partitions argument to import_vcf.

If you’re using preemptible workers, then the executors are lost because Google is preempting them and eventually giving you a replacement. Shuffle operations (e.g. repartition) will fail in this environment.

I see you’re trying to reduce partition count, try naive_coalesce instead.

You can also change min_block_size on hl.init to force your blocks to be larger.

Thanks

The Spark scratch directory is set by the spark.local.dir configuration and SPARK_LOCAL_DIRS environment variable. I assume that’s where the shuffle files would end up.

0.2.18 has failed for the same reason.We shall try:

mt = hail.import_vcf(vcfs, min_partitions=PARTITIONS)

if mt.n_partitions() > PARTITIONS:
    mt = mt.naive_coalesce(PARTITIONS)

mt.write(f"gs://my-bucket/foo.mt", overwrite=True)

…on GCP. How does one control the preemptible workers? The hailctl defaults suggest that this isn’t the norm:

--num-preemptible-workers NUM_PREEMPTIBLE_WORKERS, --n-pre-workers NUM_PREEMPTIBLE_WORKERS, -p NUM_PREEMPTIBLE_WORKERS
                      Number of preemptible worker machines (default: 0).

hailctl defaults to 0 preemptible workers because we have no way of guessing how many workers you would want. What do you mean by “control” the preemptible workers?

I’m not setting --num-preemptible-workers, so I assume that means my workers are not preemptible. (Forgive my Google Cloud naivete.) If that’s the case, what is the implication from the loss of slaves?

How big a cluster are you using, and how big is your dataset?

A shuffle needs to store the entire dataset on local hard drives on the cluster workers. If you have a tiny cluster and a big dataset, you could be seeing this error due to that.

You can specify the number of non-preemptible workers and the number of preemptible workers separately. The total number of workers is the sum of these two parameters. An executor typically has four cores, so you’ll have that sum over four executors.

A Dataproc cluster must have at least two non-preemptible workers. Preemptible workers are 1/5 the cost of non-preemptible, so folks usually use a handful of non-preemptibles (often the default of 2, rarely more than 10) and hundreds of preemptibles.

NB: preemptible workers can be preempted which causes shuffles to fail. If you have to shuffle, you should generally use non-preemptible workers only.

We’re using 64 non-preemptible workers:

hailctl dataproc start --zone="europe-west2-a" --num-workers=64 # etc., etc.

The VCFs we’re trying to combine into a Matrix Table weigh in, in this case, at 2.2TiB, distributed over ~1,700 files (not-at-all uniformly). So I suspect, as @tpoterba suggests, it doesn’t fit over 64x n1-standard-8s.


EDIT The default --worker-boot-disk-size is 40GiB; x64 is 2.5TiB, so maybe it would fit (just about).

The serialized representation in a spark shuffle isn’t exactly the same as a matrix table, so I’d not be surprised if it’s a bit bigger.

I think we’d never really expect a 2.2T shuffle to go well, though – switching to naive_coalesce should solve the problem.

It did indeed. Thank you, all :slight_smile: