Which should be the maximum memory used by Hail per Spark task. It only tracks memory managed by Hail directly though, not any Java objects that might be created. There are methods like linear_regression_rows that currently make some large Java objects, and their memory usage will undercounted here. We are working on this.
CPU use is also hard to estimate. Roughly it should be total job time multiplied by number of cores, but if a pipeline does a lot of work on the driver node at some point, this will overestimate. You may be able to get more precise task start and stop times from Spark.
I did look into hail logs but I couldn’t find any term like TaskReport.
These were the common logs
2021-07-08 01:02:57 root: INFO: Timer: Time taken for LowerMatrixToTable -- Verify : 0.090ms, total 23.506ms
2021-07-08 01:02:57 root: INFO: Timer: Time taken for LowerMatrixToTable : 3.748ms, total 23.566ms, tagged coverage 93.2
2021-07-08 01:02:57 root: INFO: Timer: Time taken for Optimize -- FoldConstants : 0.817ms, total 29.661ms
2021-07-08 01:02:57 root: INFO: Timer: Time taken for Optimize -- ExtractIntervalFilters : 0.283ms, total
So, the higher-level goal here is to find tasks that are taking more than a day and see if they can be parallelized further by scaling or redefining the pipeline.
I have been looking at spark logs. I am not sure how to relate them to hail tasks.
Should I check spark jobs between specified times and get cpu and ram usage across all executors?
In case, there is no out of the box solution, I will try to provide rough estimate to my team. I can look at the spark UI for the jobs or task started and filter them by time brackets.
E.g. Computation done between 11:25 PM to 11:45 PM must sample_qc only. However, it may be drastically incorrect in case there are parallel tasks running.
This can be achieved via two ways:
Spark REST API - It exposes the values of the Task Metrics collected by Spark executors with the granularity of task execution. This is not possible for us since we do not have a REST API setup. Our cluster is managed by a third party source.
Selenium automation: I can scrape Spark UI and extract parameters.
We don’t currently have any way to determine how long a particular operation (like sample_qc) takes. By the time execution begins, the context of what python commands created the Spark jobs is lost.
This is hard to do because we have an optimizing compiler that will fuse operations together. Like, if you do a sample_qc, then an annotate_cols, we will recognize that those are both column annotation operations and fuse them together into one bulk column annotation, so it no longer makes sense to ask how long one or the other took, since they became one operation.