How to analyze Hail job logs using Spark UI after terminating a cluster

Hello,

I am running the Hail GVCF combiner using Jupyter notebook in Dataproc.

When the GVCF combiner is running, I can monitor the job progress in Spark UI.
I am wondering if I can see the job log in the same way after terminating the cluster which Hail was run.

Thank you.

1 Like

For jobs like this, I’d recommend using hailctl dataproc submit (which wraps gcloud dataproc submit) since jupyter notebooks have problems with reconnecting to see execution status.

Thank you for your reply. I will try to use ‘submit’ instead of ‘notebook’. If I use ‘submit’, can I see the job log using Spark UI, even after terminating the cluster? What I want is to analyze the executors’ info after the function run is done and the cluster is stopped. I would like to know if it is possible or not.

No, that’s not possible, since the Spark UI is a web server run by the spark driver machine. Once that machine is no longer running, the UI is dead.

Got it. Thanks. Then, how I can check the job results and task information after the machine was gone?

You cannot. At least the driver machine must stay alive. What do you hope to learn from the Spark task information?

Thank you for your answer. I am trying to do joint-calling for 1000 GVCFs using Hail GVCF combiner. If it can be run successfully in our conditions, we may extend the number of GVCFs. In this work, my first step is to find out the parameters in Hail function, Spark, and Dataproc for optimizing the runtime and cost of Hail GVCF combiner function for a GVCF small set. I would like to keep track of all job logs that depend on parameter changes. When I analyze the results, I want to stop the cluster for saving the cost, because I only need to see the completed job results. I hope it could be the answer of your question. If you can give me some advice related to my work, it would be welcome. Thank you.

You can use hailctl dataproc modify --num-preemptible-workers 0 to shrink the cluster. The minimal cluster is 2 non-preemptible (regular) workers and one leader. That should cost very little money per hour and give you plenty of time to analyze the logs.

I doubt you’ll find much useful information in the worker logs.

Maybe @tpoterba or @chrisvittal can provide some information on recommended worker configurations.

I think the right model here is to use autoscaling, so you can inspect this stuff while paying only for the driver machine (less than $1.00 / hr).

There are some instructions on autoscaling to be found here: https://hail.is/docs/0.2/experimental/vcf_combiner.html

and here:
broad.io/hail-tips-and-tricks-1

Thanks to both @danking and @tpoterba. I will definitely look into your reference. I am sure it would be very helpful for my work.