Hail command stalls or "no space left on device" with VEP annotation

Hi all,

I am trying to work with a VEP-annotated version of the Genebass variant-level summary statistics. When I try to work with the VEP-annotated version, Hail either stalls (the progress bar stops) or I get FatalError: IOException: No space left on device – despite not trying to write out any file to storage, as you can see from the code below.

Does this sound like insufficient memory/RAM, or some other issue? If memory, what would be optimal parameters to set up a VM with hailctl for this case? My current configuration is the hailctl default (hailctl dataproc start cluster_name)

Here is example code which generates the error (it runs fine if you exclude the annotate_rows command):

#load genebass variants
genebass_variant = hl.read_matrix_table('path_to_genebass_variants’)
genebass_variant = genebass_variant.key_rows_by(genebass_variant.markerID)

#Filter variants
vep_ht = hl.read_table("path_to_genebass_vep_hailtable”)
vep_ht = vep_ht.key_by("markerID")
genebass_variant = genebass_variant.filter_rows(genebass_variant.annotation == "missense")
genebass_variant = genebass_variant.annotate_rows(vep = vep_ht[genebass_variant.markerID].vep)
genebass_variant = genebass_variant.filter_rows(genebass_variant.gene == "PCSK9")
genebass_variant.entries().show(10)

Thank you! -Dan

It’s hard to know exactly what the problem is without the full stack trace, but my guess is that Spark is using HDFS to re-order your data in the key_by. You can avoid that by explicitly initializing Hail and specifying a temporary directory:

hl.init(tmp_dir='gs://my-bucket/tmp')

Also, I recommend against using the entries(). That is an inefficient representation of the entries of a matrix table. If you want to look at the entries of a matrix table, just show the matrix table itself:

genebass_variant.show()