Progress output when running on Jupyter notebook

I am wondering about the progress output when I was running some query using Hail on Jupyter notebook. The image is attached.
What does the “stage” here mean? I do not see these information if I run the same code on Google colab.

Many thanks!

That’s an Apache Spark progress bar. The typical Hail runtime is to use Spark as a distributed backend, and this progress bar gives you some sense of what’s going on behind the scenes. It’s not easy to map parts of a Hail query onto individual stages, though.

1 Like

Thanks Tim. Also, is there any way to specify the execution of Hail so it can use GPU and speed up the execution?

Hail doesn’t have a GPU backend right now, no. What you can do, however, is use a cluster of CPUs to scale up to bigger datasets. Generating code for GPUs isn’t a super high priority for us until we’re hitting the limits of horizontal scalability on CPUs.

1 Like

Thanks. Very useful.

Hi Tim, I am running Hail queries on a M1 Max Macbook. I noticed that it is around >6 times faster than Google colab (no GPU). To my understanding, M1 Max has 10 cpu cores and 32 gpu cores. Can it be considered as a cpu cluster which makes it runs faster for Hail compare to running on Colab?

Yes, by default you’ll be running Spark with the configuration local[*], which will use all available CPUs.

I’d recommend also configuring memory as here: Java Heap Space out of memory - #6 by danking

The default Spark memory settings might be throttling the computation a bit.

1 Like