Too many open files error

wonu · August 11, 2020, 1:14pm

Hi, I am suddenly getting this or similar errors when I try to write out files in hail:
Hail version: 0.2.53-96ec0aef3db7
Error summary: FileNotFoundException: /media/veracrypt10/Analyses/Analyses/scz_allentries.mt/rows/rows/parts/part-08495-19-8495-0-f6a750fd-d6e5-93c6-d136-ded5ed361c17 (Too many open files). Command run: mt.rows().select(‘qc’, ‘control_qc’, ‘case_qc’).flatten().export(‘Analyses/final_qc.variants.tsv.bgz’)

I also had a similar error when I tried to run: mt_rows.write(‘annotations/gene.ht’, overwrite=True). In both instances the file that could not be found exists and I’m not sure what “too many open files” means in this context.

I would appreciate any assistance you can provide.

Wonu

tpoterba · August 11, 2020, 1:17pm

Can you share the full stack trace, as well as the runtime you’re using (e.g. Spark Cluster, or local installation on a server (and how many cores on that server), etc)?

wonu · August 12, 2020, 8:06pm

Hi,

Please see the attached file. I’m running hail locally on a computer. I run the commands below to initialize it.

PYSPARK_SUBMIT_ARGS="–driver-memory 8G --executor-memory 8G pyspark-shell" ipython
import hail as hl
hl.init(min_block_size=128)

Is this the info you mean?
stacktrace_110820.txt (35.1 KB)

chrisvittal · August 12, 2020, 8:42pm

That error indicates that the code your pipeline is executing is not cleaning up resources properly, and leaving open files around.

Do you have a sense of what version of hail this started being an issue?

chrisvittal · August 12, 2020, 8:56pm

I’d like to know more about your system. Could you run the following in a shell prompt?

uname -msrv
ulimit -a
nproc

wonu · August 13, 2020, 7:42am

Hi, I’m pretty sure it was version 0.2.49. I then updated it to 0.2.53 as a first resolve.

uname -msrv
Linux 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020 x86_64

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256631
max locked memory (kbytes, -l) 16384
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 256631
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

nproc
8

chrisvittal · August 13, 2020, 5:25pm

Thanks for the info, nothing unusual here.

When I tried toy examples to replicate this issue, I could not. What file system is /media/veracrypt10 using?

You can try running your pipeline, getting the pid of the SparkSubmit process with jps and using lsof -p <PID> to see which files it’s holding open.

I’m sorry I can’t be of more assistance.

wonu · August 14, 2020, 8:39am

Hi again,

/media/veracrypt10 is using ext4

I also checked the open files and there are indeed a lot open, although I don’t actually know what a “normal” amount would be. I’ve attached the output from checking. Should I be trying to close them manually? If so how would I go about this and which was should I (or not) close? They all have the same PID.

open_files_spark_140820.txt (572.1 KB)

tpoterba · August 14, 2020, 11:30am

Wow, the variant_list file appears hundreds of times. That’s not good.

What’s the full python script you’re running?

wonu · August 14, 2020, 12:06pm

Attached!
varqc_scz_casecontrol.txt (3.8 KB)

tpoterba · August 14, 2020, 1:15pm

something to try – could you add

ht_initial_variants = ht_initial_variants.checkpoint('some_path.ht')

directly after this line:

ht_initial_variants = ht_initial_variants.key_by(ht_initial_variants.locus, ht_initial_variants.alleles)

wonu · August 18, 2020, 7:47am

Hi, this fixed the issue. Thank you so much! Can you explain why though? checkpoint() is essentially the same as write() right? Or was it simply a case of changing the input file since the variant_list file was seemingly the problem here?

Thanks again!

tpoterba · August 18, 2020, 9:23am

Great that this has unblocked you. I’d say that it’s worked around the issue rather than fixed it – I want to dig deeper into what Hail’s execution is doing here.

Checkpoint is identical to ht.write(path); ht = hl.read_table(path) aside from using a compression codec that is a little faster but produces slightly larger files. The reason I wanted to try this is that Hail’s execution is lazy, and when you do execute something like your final export, lots of operations that are chained together get executed all at once. Inserting a checkpoint reduces the amount of computation that’s combined together, allowing for more visibility (or sometimes fixing issues where we’re exceeding resource allotments).

tpoterba · August 18, 2020, 3:04pm

I believe that this will fix the core issue:

tpoterba · August 18, 2020, 5:38pm

er, nevermind, this was a red herring.

wonu · August 20, 2020, 11:55am

I hadn’t tried it yet. I will keep an eye on this thread in case a fix becomes available.

wonu · September 10, 2020, 3:41pm

Hi again! Just wondering if there is a fix for this as I still occasionally run into the same problem (in slightly different circumstances).

tpoterba · September 10, 2020, 3:57pm

Ah, sorry, I didn’t update you here – we’ve got a github issue that is perhaps a better tracker:

Short answer, I think this might be fixed in 0.2.57, but since we weren’t able to replicate, I’m not 100% sure, but we did fix code that could leak file handles.

bluesky · December 15, 2022, 7:40pm

Hi,

Sorry to dig this topic up, but I have came across the very similar situation. So I’d very much like your help and suggestions. Please see below the scripts and error message.

Thanks much in advance!

The Hail version I’m using is 0.2.104.

Below is the script:

import hail as hl 
hl.init(spark_conf={'spark.driver.memory': '20g','spark.executor.memory': '40g'}, tmp_dir = filepath)

mt = hl.read_matrix_table('/filepath/my.mt', _n_partitions =6000)
mt_filt = mt.filter_entries((mt.HFT ==1) | (((mt.HFT ==8) | (mt.HFT ==16)) & (hl.max(mt.GP) > 0.95)))

mt_filt = mt_filt.filter_rows((hl.len(mt_filt.alleles) == 2) & hl.is_snp(mt_filt.alleles[0], mt_filt.alleles[1]) & 
                     (hl.agg.fraction(hl.is_defined(mt_filt.GT)) > 0.99) & 
                     (hl.agg.mean(mt_filt.GT.n_alt_alleles())/2 > 0.01) & 
                     (mt_filt.locus.contig != "chrX") & (mt_filt.locus.contig != "chrY") & (mt_filt.locus.contig != "chrM"))

#LD pruning    
pruned_variant_table = hl.ld_prune(mt_filt.GT, r2 = 0.1)

#keep LD pruned variants 
pruned_mt = mt_filt.filter_rows(hl.is_defined(pruned_variant_table[mt_filt.row_key]), keep = True)

#save the pruned variants as a table 
pruned_mt.write('/filepath/ld_pruned_comm_bialle_highqual_variants.ht')

Here is the log file and error message:

java.io.FileNotFoundException: /home/Data/my.mt/entries/rows/parts/part-0911-72-911-0-2115abb3-6ca1-f5f6-ad8d-83ae8fd91bdf (Too many open files)
	at java.io.FileInputStream.open0(Native Method)
	at java.io.FileInputStream.open(FileInputStream.java:195)
	at java.io.FileInputStream.<init>(FileInputStream.java:138)
	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.<init>(RawLocalFileSystem.java:111)
	at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:213)
	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:147)
	at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
	at is.hail.io.fs.HadoopFS.openNoCompression(HadoopFS.scala:91)
	at is.hail.io.fs.FS.open(FS.scala:354)
	at is.hail.io.fs.FS.open$(FS.scala:353)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:72)
	at is.hail.io.fs.FS.open(FS.scala:366)
	at is.hail.io.fs.FS.open$(FS.scala:365)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:72)
	at __C5553collect_distributed_array_matrix_native_writer.apply_region10_62(Unknown Source)
	at __C5553collect_distributed_array_matrix_native_writer.apply_region9_247(Unknown Source)
	at __C5553collect_distributed_array_matrix_native_writer.apply(Unknown Source)
	at __C5553collect_distributed_array_matrix_native_writer.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$4(BackendUtils.scala:48)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$3(BackendUtils.scala:47)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:799)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)




Hail version: 0.2.104-1940d9e8eaab
Error summary: FileNotFoundException: /home/Data/my.mt/entries/rows/parts/part-0911-72-911-0-2115abb3-6ca1-f5f6-ad8d-83ae8fd91bdf (Too many open files)

[Stage 13:========>                                            (908 + 1) / 5789]

danking · December 16, 2022, 3:42pm

Hey @bluesky,

It’s possible this is a Hail bug and we’ll look into that. For now, I recommend increasing the open file limit.

Topic		Replies	Views
Filter mt rows by missingness Hail Query & hailctl	3	270	February 6, 2023
Doing a linear_mixed_regression_rows, why this error Hail Query & hailctl	5	584	October 12, 2019
Export_vcf OutOfMemoryError: Java heap space despite --driver-memory 8g Hail Query & hailctl	26	2857	January 11, 2019
Still struggling with OutOfMemoryError: Java heap space Hail Query & hailctl	0	1096	March 4, 2019
EOFException Error in 'count_rows' Hail Query & hailctl	4	381	September 22, 2020

Too many open files error

Related topics