Hi, I am suddenly getting this or similar errors when I try to write out files in hail:
Hail version: 0.2.53-96ec0aef3db7
Error summary: FileNotFoundException: /media/veracrypt10/Analyses/Analyses/scz_allentries.mt/rows/rows/parts/part-08495-19-8495-0-f6a750fd-d6e5-93c6-d136-ded5ed361c17 (Too many open files). Command run: mt.rows().select(‘qc’, ‘control_qc’, ‘case_qc’).flatten().export(‘Analyses/final_qc.variants.tsv.bgz’)
I also had a similar error when I tried to run: mt_rows.write(‘annotations/gene.ht’, overwrite=True). In both instances the file that could not be found exists and I’m not sure what “too many open files” means in this context.
I would appreciate any assistance you can provide.
Can you share the full stack trace, as well as the runtime you’re using (e.g. Spark Cluster, or local installation on a server (and how many cores on that server), etc)?
I also checked the open files and there are indeed a lot open, although I don’t actually know what a “normal” amount would be. I’ve attached the output from checking. Should I be trying to close them manually? If so how would I go about this and which was should I (or not) close? They all have the same PID.
Hi, this fixed the issue. Thank you so much! Can you explain why though? checkpoint() is essentially the same as write() right? Or was it simply a case of changing the input file since the variant_list file was seemingly the problem here?
Great that this has unblocked you. I’d say that it’s worked around the issue rather than fixed it – I want to dig deeper into what Hail’s execution is doing here.
Checkpoint is identical to ht.write(path); ht = hl.read_table(path) aside from using a compression codec that is a little faster but produces slightly larger files. The reason I wanted to try this is that Hail’s execution is lazy, and when you do execute something like your final export, lots of operations that are chained together get executed all at once. Inserting a checkpoint reduces the amount of computation that’s combined together, allowing for more visibility (or sometimes fixing issues where we’re exceeding resource allotments).
Ah, sorry, I didn’t update you here – we’ve got a github issue that is perhaps a better tracker:
Short answer, I think this might be fixed in 0.2.57, but since we weren’t able to replicate, I’m not 100% sure, but we did fix code that could leak file handles.
Sorry to dig this topic up, but I have came across the very similar situation. So I’d very much like your help and suggestions. Please see below the scripts and error message.
java.io.FileNotFoundException: /home/Data/my.mt/entries/rows/parts/part-0911-72-911-0-2115abb3-6ca1-f5f6-ad8d-83ae8fd91bdf (Too many open files)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:111)
at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:213)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:147)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
at is.hail.io.fs.HadoopFS.openNoCompression(HadoopFS.scala:91)
at is.hail.io.fs.FS.open(FS.scala:354)
at is.hail.io.fs.FS.open$(FS.scala:353)
at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:72)
at is.hail.io.fs.FS.open(FS.scala:366)
at is.hail.io.fs.FS.open$(FS.scala:365)
at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:72)
at __C5553collect_distributed_array_matrix_native_writer.apply_region10_62(Unknown Source)
at __C5553collect_distributed_array_matrix_native_writer.apply_region9_247(Unknown Source)
at __C5553collect_distributed_array_matrix_native_writer.apply(Unknown Source)
at __C5553collect_distributed_array_matrix_native_writer.apply(Unknown Source)
at is.hail.backend.BackendUtils.$anonfun$collectDArray$4(BackendUtils.scala:48)
at is.hail.utils.package$.using(package.scala:635)
at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162)
at is.hail.backend.BackendUtils.$anonfun$collectDArray$3(BackendUtils.scala:47)
at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:799)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Hail version: 0.2.104-1940d9e8eaab
Error summary: FileNotFoundException: /home/Data/my.mt/entries/rows/parts/part-0911-72-911-0-2115abb3-6ca1-f5f6-ad8d-83ae8fd91bdf (Too many open files)
[Stage 13:========> (908 + 1) / 5789]