Matrixtable filtering and LD pruning Error message - No space left on device

Hi,

I was trying to filtering out variants followed by LD pruning, but got the Error message: IOException: No space left on device.

The matrixtable includes ~42 million variants and 13K samples. I’m running it on local HPC.

I suspect I may not use the efficient way to do it. Any suggestions/comments would be appreciated. Thank you!

Below is the detail of the scripts.

import hail as hl 
hl.init(spark_conf={'spark.driver.memory': '15g','spark.executor.memory': '15g'})

mt = hl.read_matrix_table(mt_path, _n_partitions =6000)

mt_filt = mt.filter_entries((mt.HFT ==1) | (((mt.HFT ==8) | (mt.HFT ==16)) & (hl.max(mt.GP) > 0.95)))

mt_filt = mt_filt.filter_rows((hl.len(mt_filt.alleles) == 2) & 
                     hl.is_snp(mt_filt.alleles[0], mt_filt.alleles[1]) & 
                     (hl.agg.fraction(hl.is_defined(mt_filt.GT)) > 0.99) & 
                     (hl.agg.mean(mt_filt.GT.n_alt_alleles())/2 > 0.01) & 
                     (mt_filt.locus.contig != "chrX") & (mt_filt.locus.contig != "chrY") & 
                     (mt_filt.locus.contig != "chrM"))

pruned_variant_table = hl.ld_prune(mt_filt.GT, r2 = 0.1)

pruned_mt = mt_filt.filter_rows(hl.is_defined(pruned_variant_table[mt_filt.row_key]), keep = True)
        
pruned_mt.write('/filepath/ld_pruned_comm_bialle_highqual_variants.mt', overwrite = True)

Hey @bluesky !

Can you share the hail log file from this run?

This message means you’re running out of space. It’s possible you don’t have enough quota or disk space at /filepath.

In general, duplicating all the genotypes (i.e. writing an entire MT with the GTs) will use quite a bit of space!

If the prune only keeps a small percent of variants (~10% or fewer) I expect a performance benefit to operations that read that MT (because you’re reading 10% of the data). I also expect that writing it will only take 10% as much space as the original MT.

On the other hand, if your prune keeps a large percentage of the rows, copying the entire dataset isn’t worth the space and time cost. Instead, I recommend saving the LD pruned variants as a table:

pruned_variant_table.write('...')

And using it as a filter in the future:

pruned_variant_table = hl.read_table(...)
mt = mt.semi_join_rows(pruned_variant_table)

Hi @danking

Thank you so much for your reply and suggestion!

You’re right - the issue is that the written file used all the space. Additionally, I later noted that the default ‘temporary file folder’ in my HPC is full (only 10G) so maybe that is the reason I got the error message (i.e., no space left on device)?

Are there any ways to ‘change’ the default ‘tmp’ folder in Hail? If I could change the tmp to my individual folder on the HPC (much larger than 10G), then this might help resolve the issue?

Please let me know if and how I may change the path for the tmpDir.

Also, appreciate your suggestions on saving the LD pruned variants as a table. After pruning, I still have ~18% left. I’ll try to save it as a table.

Again, thank you very much!

For temporary directory: hl.init(tmp_dir=..., local_tmpdir=...)

This worked!

Thank you very much for your help!

1 Like