Hi,
Our team is currently working with UKBB releases of 200K and 450K WES. We have created lots of hail matrix tables(initial, preprocessing and analysis stage) and stored them in aws s3 storage in glacier mode(for cost optimization).
However, Hail matrix tables consist of large number of tiny files(partitions) which are costly to query(every API call) over s3 bucket(specially in glacier mode). If we keep in standard storage, the sitting data costs a lot.
Question:
- Is there a recommended archival policy for storing large hail matrix tables in aws s3 bucket?
- Should we zip the whole hail matrix table into one file and upload it to s3 bucket?
- Any other or compressed forms of hail matrix table?
Let me know your thoughts
In general, you can make each partition bigger by choosing fewer partitions with repartition
or naive_coalesce
.
I strongly recommend against writing out the entry data (aka the entire matrix table) multiple times. For example, if you’ve computed a bunch of row fields, you can save just those row fields:
mt.rows().write('s3://bucket/row-metadata.ht')
And then when you need to use those row fields again, you can join them back on:
mt = hl.import_bgen(...)
ht = hl.read_table('s3://bucket/row-metadata.ht')
mt = mt.annotate_rows(**ht[mt.row_key])
You can also use functions like semi_join_rows
to only keep the rows that are present in another table.
The same principles applies to column metadata.
Hail Tables and Matrix Tables partitions are already compressed using lz4.
1 Like