Too many open files error

Hi, I am suddenly getting this or similar errors when I try to write out files in hail:
Hail version: 0.2.53-96ec0aef3db7
Error summary: FileNotFoundException: /media/veracrypt10/Analyses/Analyses/scz_allentries.mt/rows/rows/parts/part-08495-19-8495-0-f6a750fd-d6e5-93c6-d136-ded5ed361c17 (Too many open files). Command run: mt.rows().select(‘qc’, ‘control_qc’, ‘case_qc’).flatten().export(‘Analyses/final_qc.variants.tsv.bgz’)

I also had a similar error when I tried to run: mt_rows.write(‘annotations/gene.ht’, overwrite=True). In both instances the file that could not be found exists and I’m not sure what “too many open files” means in this context.

I would appreciate any assistance you can provide.

Wonu

Can you share the full stack trace, as well as the runtime you’re using (e.g. Spark Cluster, or local installation on a server (and how many cores on that server), etc)?

Hi,

Please see the attached file. I’m running hail locally on a computer. I run the commands below to initialize it.

PYSPARK_SUBMIT_ARGS="–driver-memory 8G --executor-memory 8G pyspark-shell" ipython
import hail as hl
hl.init(min_block_size=128)

Is this the info you mean?
stacktrace_110820.txt (35.1 KB)

That error indicates that the code your pipeline is executing is not cleaning up resources properly, and leaving open files around.

Do you have a sense of what version of hail this started being an issue?

I’d like to know more about your system. Could you run the following in a shell prompt?

uname -msrv
ulimit -a
nproc

Hi, I’m pretty sure it was version 0.2.49. I then updated it to 0.2.53 as a first resolve.

uname -msrv
Linux 5.4.0-42-generic #46~18.04.1-Ubuntu SMP Fri Jul 10 07:21:24 UTC 2020 x86_64

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256631
max locked memory (kbytes, -l) 16384
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 256631
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

nproc
8

Thanks for the info, nothing unusual here.

When I tried toy examples to replicate this issue, I could not. What file system is /media/veracrypt10 using?

You can try running your pipeline, getting the pid of the SparkSubmit process with jps and using lsof -p <PID> to see which files it’s holding open.

I’m sorry I can’t be of more assistance.

Hi again,

/media/veracrypt10 is using ext4

I also checked the open files and there are indeed a lot open, although I don’t actually know what a “normal” amount would be. I’ve attached the output from checking. Should I be trying to close them manually? If so how would I go about this and which was should I (or not) close? They all have the same PID.

open_files_spark_140820.txt (572.1 KB)

Wow, the variant_list file appears hundreds of times. That’s not good.

What’s the full python script you’re running?

Attached!
varqc_scz_casecontrol.txt (3.8 KB)

something to try – could you add

ht_initial_variants = ht_initial_variants.checkpoint('some_path.ht')

directly after this line:

ht_initial_variants = ht_initial_variants.key_by(ht_initial_variants.locus, ht_initial_variants.alleles)

Hi, this fixed the issue. Thank you so much! Can you explain why though? checkpoint() is essentially the same as write() right? Or was it simply a case of changing the input file since the variant_list file was seemingly the problem here?

Thanks again!

Great that this has unblocked you. I’d say that it’s worked around the issue rather than fixed it – I want to dig deeper into what Hail’s execution is doing here.

Checkpoint is identical to ht.write(path); ht = hl.read_table(path) aside from using a compression codec that is a little faster but produces slightly larger files. The reason I wanted to try this is that Hail’s execution is lazy, and when you do execute something like your final export, lots of operations that are chained together get executed all at once. Inserting a checkpoint reduces the amount of computation that’s combined together, allowing for more visibility (or sometimes fixing issues where we’re exceeding resource allotments).

I believe that this will fix the core issue:

er, nevermind, this was a red herring.

I hadn’t tried it yet. I will keep an eye on this thread in case a fix becomes available.

Hi again! Just wondering if there is a fix for this as I still occasionally run into the same problem (in slightly different circumstances).

Ah, sorry, I didn’t update you here – we’ve got a github issue that is perhaps a better tracker:

Short answer, I think this might be fixed in 0.2.57, but since we weren’t able to replicate, I’m not 100% sure, but we did fix code that could leak file handles.