Hail having difficulty scaling to 400K

Elizabeth_Kiernan · June 3, 2025, 5:22pm

Hi Hail team,
We recently tried to filter a 400K sample VDS to 10K samples and export it to a hail MT. We hit problems at this scale in a Terra Jupyter notebook with a spark cluster. We ran 400 primary workers, 100 preemtiples and after 12 hrs, still couldn’t get it to finish. We also tried running it with only primary workers and still couldn’t get to go through. We ultimately did the conversion using a different route, so it’s not an immediate issue anymore, but we thought it might be helpful to share this experience for future development. We’ve used a similar process for smaller sample sizes, so we’re not sure where the scaling bottleneck is. Thanks for any advice or future improvements.

pettyalex · June 6, 2025, 5:59pm

What size of workers did you use? I’m also trying to understand Hail scaling, and if we should use larger workers or more workers.

Elizabeth_Kiernan · June 9, 2025, 1:09pm

We used the Terra default for workers: 4 CPUs, 15 GB Memory, 150 GB disk

Topic		Replies	Views
Hail/Apache Spark Not Scaling by Cluster Size Hail Query & hailctl	2	221	February 21, 2024
Hail on gcloud dataproc cluster runtime issues Hail Query & hailctl	4	383	November 2, 2021
Running Hail on Terra -- how should I optimize? Hail Query & hailctl	7	1238	February 3, 2021
SampleQC Script Stalling on UKB RAP Hail Query & hailctl	3	446	July 27, 2023
Slow Terra Output Hail Batch & General Cloud	2	181	March 11, 2024

Hail having difficulty scaling to 400K

Related topics