Pca: CPU monitoring


I tried to run the pca function on different size of dataset (“pca_eigenvalues, pca_scores, _ = hl.hwe_normalized_pca(vds.GT, k = 2)”).
These are the results with a machine with 64 CPU and 416 GB of memory:

  • for 77 000 variants: 1min46s (used 4% of CPU)
  • for 2M variants: 8min50s (used 30% of CPU)
  • for 6M variants: more than 2 hours (used 30% of CPU)

I didn’t run any other job at the same time. Why does not this function use all the CPU?



how many partitions does the dataset have?

Hail’s PCA uses an algorithm inside Spark’s MLLib that is extremely sensitive to partitioning. You should have 100 partitions or fewer in this case.

Spark has a lot of overhead, which is why you’re not seeing high CPU utilization for this iterative algorithm.

My dataset has around 230M variants and 397 partitions. I filtered the dataset to estimate the time of the pca with all the SNP. I don’t know how many partitions for each filter. Is there a way to retrieve this information?

CPU usage may also be limited at 50% if the cores are virtual (on GCP, a 64 core machine has 32 physical hyperthreaded cores) and the underlying BLAS natives are more efficient without hyperthreading. For example, most NumPy linear algebra routines with MKL top out at 50% CPU, see the best reply here:

Filtering doesn’t change the number of partitions, it just removes variants from each partition. To change the number of partitions, you’d use the repartition command. I’m curious, how many samples do you have?

around 2000 samples

This PCA function runs 3 spark jobs and the job number 3 (treeAggregate at RowMatrix.scala:93) runs several tasks several times. For example:

  • for 77 000 variants, 1 task x 112 times,
  • for 2M variants, 8 tasks x 61 times,
  • for 6M variants, 15 tasks x 31 times
  1. Does this depend on the partitions?

  2. How are the partitions working? Is it several files with all the samples (columns) but only some variants (rows)?

  1. Yes, you’ll likely find that PCA runs much faster if you repartition after filtering, e.g. mt = mt.repartition(100). This does a shuffle to “consolidate” the filtered dataset, so there’s no need to scan past filtered records on each iteration of the PCA algorithm. I think PCA is taking so long because each iteration is still reading through all 230M variants even though it ignores nearly all of them.
  2. Yes, each partition is a region of variants across all samples.

Thanks! I will try mt.repartition.
Do you think that the size of the dataset should be an issue to run the PCA on all the variants? I am doing these filter to estimate the time of the PCA on all the dataset.

Direct answer to your first question (which I now see didn’t ask about timing).

It’s hard to say how the number of iterations before convergence depends on the number of variants, but apparently fewer iterations are required with more variants. I think more variants results in more nodes in the treeAggregate, which corresponds to more tasks per iteration.
Neither of these should depend on the number of partitions.

Curious, why would you want to run PCA on the whole dataset? And how many PCs would you like?