Hi,
I am wondering about the progress output when I was running some query using Hail on Jupyter notebook. The image is attached.
What does the “stage” here mean? I do not see these information if I run the same code on Google colab.
Many thanks!
Hi,
I am wondering about the progress output when I was running some query using Hail on Jupyter notebook. The image is attached.
What does the “stage” here mean? I do not see these information if I run the same code on Google colab.
Many thanks!
That’s an Apache Spark progress bar. The typical Hail runtime is to use Spark as a distributed backend, and this progress bar gives you some sense of what’s going on behind the scenes. It’s not easy to map parts of a Hail query onto individual stages, though.
Thanks Tim. Also, is there any way to specify the execution of Hail so it can use GPU and speed up the execution?
Hail doesn’t have a GPU backend right now, no. What you can do, however, is use a cluster of CPUs to scale up to bigger datasets. Generating code for GPUs isn’t a super high priority for us until we’re hitting the limits of horizontal scalability on CPUs.
Thanks. Very useful.
Hi Tim, I am running Hail queries on a M1 Max Macbook. I noticed that it is around >6 times faster than Google colab (no GPU). To my understanding, M1 Max has 10 cpu cores and 32 gpu cores. Can it be considered as a cpu cluster which makes it runs faster for Hail compare to running on Colab?
Yes, by default you’ll be running Spark with the configuration local[*]
, which will use all available CPUs.
I’d recommend also configuring memory as here: Java Heap Space out of memory - #6 by danking
The default Spark memory settings might be throttling the computation a bit.