Filter mt rows by missingness


I’m just trying to filter a mt rows with missing GP, but got an error message “FileNotFoundException: too many open files”.

If not increasing open file limit, would there be any other ways to extract rows or variants with missing GP?

Any suggestions? Thank you very much!

Below is the script:

hl.init(spark_conf={'spark.driver.memory': '20g','spark.executor.memory': '40g'}, tmp_dir = path, default_reference = 'GRCh38')

mt = hl.read_matrix_table('/path/mt', _n_partitions =6000)
mt_impt = mt.filter_rows(hl.agg.any(hl.is_missing(mt.GP)))  # error message shows right after this

The number of open files should be roughly controlled by the number of cores in use. You can control that with

hl.init(master='local[N]')  # N is number of cores to use

I’m rather surprised that you’re hitting this file limit though. Usually the file limit is a lot higher than the number of cores in use.

Can you share more information about your environment? Where are you executing this code?

Thank you for your reply!

I’m using a local HPC cluster which has 8 cores and 241 gb memory total. The number of open file is the default 1024.

Before I got your suggestion, I tried to reduce the number of partition (from 6000 to 100), then the issue of too many open files went away!

If using your suggestion, should I still use the number of cores as 8?

Thank you!

1 Like

You should set the number of cores to however many cores your HPC job is permitted to use. It sounds to me that you should set it to 8.