Best archival policy for storing hail matrix tables in aws s3 bucket

Abhishek · May 24, 2022, 10:13am

Hi,

Our team is currently working with UKBB releases of 200K and 450K WES. We have created lots of hail matrix tables(initial, preprocessing and analysis stage) and stored them in aws s3 storage in glacier mode(for cost optimization).

However, Hail matrix tables consist of large number of tiny files(partitions) which are costly to query(every API call) over s3 bucket(specially in glacier mode). If we keep in standard storage, the sitting data costs a lot.

Question:

Is there a recommended archival policy for storing large hail matrix tables in aws s3 bucket?
Should we zip the whole hail matrix table into one file and upload it to s3 bucket?
Any other or compressed forms of hail matrix table?

Let me know your thoughts

danking · May 24, 2022, 2:07pm

In general, you can make each partition bigger by choosing fewer partitions with repartition or naive_coalesce.

I strongly recommend against writing out the entry data (aka the entire matrix table) multiple times. For example, if you’ve computed a bunch of row fields, you can save just those row fields:

mt.rows().write('s3://bucket/row-metadata.ht')

And then when you need to use those row fields again, you can join them back on:

mt = hl.import_bgen(...)
ht = hl.read_table('s3://bucket/row-metadata.ht')
mt = mt.annotate_rows(**ht[mt.row_key])

You can also use functions like semi_join_rows to only keep the rows that are present in another table.

The same principles applies to column metadata.

Hail Tables and Matrix Tables partitions are already compressed using lz4.

Topic		Replies	Views
Optimal analysis environment for exporting hail table/matrix table to plink or vcf file format Hail Batch & General Cloud	1	404	June 21, 2023
Best approach to join a large number of tables to a matrix table? Hail Query & hailctl	4	575	November 17, 2021
Adding annotations to large stored MatrixTable Hail Query & hailctl	2	498	July 8, 2019
Unable to write matrix tables to MinIO S3 storage Hail Batch & General Cloud	1	207	March 29, 2024
Using HDFS on EMR Hail Query & hailctl	3	60	August 23, 2024

Best archival policy for storing hail matrix tables in aws s3 bucket

Related topics