"No space left on device" error

Xophmeister · August 28, 2019, 1:59pm

We are consistently seeing this class of error when doing various operations in Hail:

java.io.IOException: No space left on device
  at java.io.FileOutputStream.writeBytes(Native Method)
  at java.io.FileOutputStream.write(FileOutputStream.java:326)
  at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
  at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
  at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:220)
  at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:173)
  at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:252)
  at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:211)
  at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:419)
  at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
  at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)

In this case, something to the effect of the following was running on GCP, using hailctl, tested with 0.2.19 and 0.2.20 (testing in 0.2.18 is currently in progress):

mt = hail.import_vcf(vcfs)

if mt.n_partitions() > PARTITIONS:
    mt = mt.repartition(PARTITIONS)

mt.write(f"gs://my-bucket/foo.mt", overwrite=True)

On GCP, the logs then indicate that the executor slaves are then systematically lost. The job doesn’t fail at this point, but it just sits there doing nothing:

java.io.IOException: Failed to send RPC RPC 5026120229335671966 to /10.154.0.146:50726: java.nio.channels.ClosedChannelException
  at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:357)
  at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:334)
  at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
  at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
  at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
  at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
  at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
  at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
  at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
  at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
  at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
  at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
  at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
  at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
  at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
  at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
  at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
  at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
  at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

We have not experienced the loss of slaves on our own, internal clusters, but we are consistently seeing the “No space left on device” error with Hail 0.2.20 here. It appears to be related to temporary space, but our attempts to provision more space have not made a difference (we’ve tried up to 0.5TiB of /tmp).

danking · August 28, 2019, 2:09pm

This means you don’t have enough local disk. I don’t know where Spark stores its shuffle intermediates, but it might not be /tmp. The real answer is don’t use repartition, it is slow and uses a Spark shuffle and is thus prone to failure. Use the min_partitions argument to import_vcf.

If you’re using preemptible workers, then the executors are lost because Google is preempting them and eventually giving you a replacement. Shuffle operations (e.g. repartition) will fail in this environment.

I see you’re trying to reduce partition count, try naive_coalesce instead.

You can also change min_block_size on hl.init to force your blocks to be larger.

Xophmeister · August 28, 2019, 2:23pm

Thanks

The Spark scratch directory is set by the spark.local.dir configuration and SPARK_LOCAL_DIRS environment variable. I assume that’s where the shuffle files would end up.

0.2.18 has failed for the same reason.We shall try:

mt = hail.import_vcf(vcfs, min_partitions=PARTITIONS)

if mt.n_partitions() > PARTITIONS:
    mt = mt.naive_coalesce(PARTITIONS)

mt.write(f"gs://my-bucket/foo.mt", overwrite=True)

…on GCP. How does one control the preemptible workers? The hailctl defaults suggest that this isn’t the norm:

--num-preemptible-workers NUM_PREEMPTIBLE_WORKERS, --n-pre-workers NUM_PREEMPTIBLE_WORKERS, -p NUM_PREEMPTIBLE_WORKERS
                      Number of preemptible worker machines (default: 0).

johnc1231 · August 28, 2019, 2:28pm

hailctl defaults to 0 preemptible workers because we have no way of guessing how many workers you would want. What do you mean by “control” the preemptible workers?

Xophmeister · August 28, 2019, 2:52pm

I’m not setting --num-preemptible-workers, so I assume that means my workers are not preemptible. (Forgive my Google Cloud naivete.) If that’s the case, what is the implication from the loss of slaves?

tpoterba · August 28, 2019, 2:54pm

How big a cluster are you using, and how big is your dataset?

A shuffle needs to store the entire dataset on local hard drives on the cluster workers. If you have a tiny cluster and a big dataset, you could be seeing this error due to that.

danking · August 28, 2019, 2:55pm

You can specify the number of non-preemptible workers and the number of preemptible workers separately. The total number of workers is the sum of these two parameters. An executor typically has four cores, so you’ll have that sum over four executors.

A Dataproc cluster must have at least two non-preemptible workers. Preemptible workers are 1/5 the cost of non-preemptible, so folks usually use a handful of non-preemptibles (often the default of 2, rarely more than 10) and hundreds of preemptibles.

NB: preemptible workers can be preempted which causes shuffles to fail. If you have to shuffle, you should generally use non-preemptible workers only.

Xophmeister · August 28, 2019, 3:00pm

We’re using 64 non-preemptible workers:

hailctl dataproc start --zone="europe-west2-a" --num-workers=64 # etc., etc.

The VCFs we’re trying to combine into a Matrix Table weigh in, in this case, at 2.2TiB, distributed over ~1,700 files (not-at-all uniformly). So I suspect, as @tpoterba suggests, it doesn’t fit over 64x n1-standard-8s.

EDIT The default --worker-boot-disk-size is 40GiB; x64 is 2.5TiB, so maybe it would fit (just about).

tpoterba · August 28, 2019, 3:11pm

The serialized representation in a spark shuffle isn’t exactly the same as a matrix table, so I’d not be surprised if it’s a bit bigger.

I think we’d never really expect a 2.2T shuffle to go well, though – switching to naive_coalesce should solve the problem.

Xophmeister · August 28, 2019, 3:51pm

It did indeed. Thank you, all

Topic		Replies	Views
No space left on device - export_vcf Help [0.1]	3	991	March 29, 2018
Matrixtable filtering and LD pruning Error message - No space left on device Hail Query & hailctl	4	320	December 14, 2022
"No space left" error with Hail installed using conda Hail Query & hailctl	4	740	October 20, 2022
Hail command stalls or "no space left on device" with VEP annotation Hail Query & hailctl	2	383	June 2, 2022
Too many open files error Hail Query & hailctl	19	1087	December 16, 2022

"No space left on device" error

Related topics