Pca: CPU monitoring

Laura · August 15, 2018, 6:45pm

Hi,

I tried to run the pca function on different size of dataset (“pca_eigenvalues, pca_scores, _ = hl.hwe_normalized_pca(vds.GT, k = 2)”).
These are the results with a machine with 64 CPU and 416 GB of memory:

for 77 000 variants: 1min46s (used 4% of CPU)
for 2M variants: 8min50s (used 30% of CPU)
for 6M variants: more than 2 hours (used 30% of CPU)

I didn’t run any other job at the same time. Why does not this function use all the CPU?

Thanks,

Laura

tpoterba · August 15, 2018, 6:57pm

how many partitions does the dataset have?

Hail’s PCA uses an algorithm inside Spark’s MLLib that is extremely sensitive to partitioning. You should have 100 partitions or fewer in this case.

Spark has a lot of overhead, which is why you’re not seeing high CPU utilization for this iterative algorithm.

Laura · August 15, 2018, 7:10pm

My dataset has around 230M variants and 397 partitions. I filtered the dataset to estimate the time of the pca with all the SNP. I don’t know how many partitions for each filter. Is there a way to retrieve this information?

jbloom · August 15, 2018, 7:11pm

CPU usage may also be limited at 50% if the cores are virtual (on GCP, a 64 core machine has 32 physical hyperthreaded cores) and the underlying BLAS natives are more efficient without hyperthreading. For example, most NumPy linear algebra routines with MKL top out at 50% CPU, see the best reply here:
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/737504

jbloom · August 15, 2018, 7:16pm

Filtering doesn’t change the number of partitions, it just removes variants from each partition. To change the number of partitions, you’d use the repartition command. I’m curious, how many samples do you have?

Laura · August 15, 2018, 7:20pm

around 2000 samples

Laura · August 15, 2018, 7:37pm

This PCA function runs 3 spark jobs and the job number 3 (treeAggregate at RowMatrix.scala:93) runs several tasks several times. For example:

for 77 000 variants, 1 task x 112 times,
for 2M variants, 8 tasks x 61 times,
for 6M variants, 15 tasks x 31 times

Does this depend on the partitions?
How are the partitions working? Is it several files with all the samples (columns) but only some variants (rows)?

jbloom · August 15, 2018, 7:49pm

Yes, you’ll likely find that PCA runs much faster if you repartition after filtering, e.g. mt = mt.repartition(100). This does a shuffle to “consolidate” the filtered dataset, so there’s no need to scan past filtered records on each iteration of the PCA algorithm. I think PCA is taking so long because each iteration is still reading through all 230M variants even though it ignores nearly all of them.
Yes, each partition is a region of variants across all samples.

Laura · August 15, 2018, 8:01pm

Thanks! I will try mt.repartition.
Do you think that the size of the dataset should be an issue to run the PCA on all the variants? I am doing these filter to estimate the time of the PCA on all the dataset.

jbloom · August 15, 2018, 8:02pm

Direct answer to your first question (which I now see didn’t ask about timing).

It’s hard to say how the number of iterations before convergence depends on the number of variants, but apparently fewer iterations are required with more variants. I think more variants results in more nodes in the treeAggregate, which corresponds to more tasks per iteration.
Neither of these should depend on the number of partitions.

jbloom · August 15, 2018, 8:03pm

Curious, why would you want to run PCA on the whole dataset? And how many PCs would you like?

Topic		Replies	Views
How to speed up hail PCA analysis Hail Query & hailctl	35	1290	July 12, 2018
PCA job aborted from SparkException Hail Query & hailctl	46	2668	July 28, 2020
Questions about optimizing Hail and Spark configs and estimating resources and runtimes Hail Query & hailctl	3	1159	December 1, 2022
Running Hail on Terra -- how should I optimize? Hail Query & hailctl	7	1240	February 3, 2021
Error summary: OutOfMemoryError: Java heap space Hail Query & hailctl	15	2596	August 18, 2022

Pca: CPU monitoring

Related topics