I’ve had a couple of issues when I’ve been performing a lot of aggregations, the aggregate_intermediates directory in the location specified by tmp_dir when Hail is initialised fills up and I get the following error:
2022-07-26 17:28:14 TaskSetManager: WARN: Lost task 874.0 in stage 336.0 (TID 529110) (192.168.252.56 executor 40): org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException): The directory item limit of /aggregate_intermediates is exceeded: limit=1048576 items=1048576
If I go into my file system and manually clear this then I can run aggregations again.
Is there a way to ensure that this subdirectory if tmp_dir is cleared at intervals?
I have experienced the same issue from time to time and I manually delete the directory to fix the issue. Is there a spark option to increase the size of this directory? Also is this possible to use a different path for each user so that deleting aggregate_intermediates for one user does not affect other users’ jobs?
This is really a Hail issue – we need to be eagerly cleaning up files when they’re no longer necessary.
@AB.Hail - there’s no Spark option here, since it’s a Hail parameter. You can set the Hail temp dir on init with
hl.init(...other args..., tmp_dir='...') You can set this to a blob store path (google/s3 bucket, etc) and that will work fine if you’re running on the cloud.
Thank you @AB.Hail and @tpoterba
I resorted to manually removing the contents of aggregate_intermediates as I was running sample_qc so many times in one script that I was running out of space in my tmp_dir.