I tried to run the pca function on different size of dataset (“pca_eigenvalues, pca_scores, _ = hl.hwe_normalized_pca(vds.GT, k = 2)”).
These are the results with a machine with 64 CPU and 416 GB of memory:
for 77 000 variants: 1min46s (used 4% of CPU)
for 2M variants: 8min50s (used 30% of CPU)
for 6M variants: more than 2 hours (used 30% of CPU)
I didn’t run any other job at the same time. Why does not this function use all the CPU?
My dataset has around 230M variants and 397 partitions. I filtered the dataset to estimate the time of the pca with all the SNP. I don’t know how many partitions for each filter. Is there a way to retrieve this information?
CPU usage may also be limited at 50% if the cores are virtual (on GCP, a 64 core machine has 32 physical hyperthreaded cores) and the underlying BLAS natives are more efficient without hyperthreading. For example, most NumPy linear algebra routines with MKL top out at 50% CPU, see the best reply here: https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/737504
Filtering doesn’t change the number of partitions, it just removes variants from each partition. To change the number of partitions, you’d use the repartition command. I’m curious, how many samples do you have?
Yes, you’ll likely find that PCA runs much faster if you repartition after filtering, e.g. mt = mt.repartition(100). This does a shuffle to “consolidate” the filtered dataset, so there’s no need to scan past filtered records on each iteration of the PCA algorithm. I think PCA is taking so long because each iteration is still reading through all 230M variants even though it ignores nearly all of them.
Yes, each partition is a region of variants across all samples.
Thanks! I will try mt.repartition.
Do you think that the size of the dataset should be an issue to run the PCA on all the variants? I am doing these filter to estimate the time of the PCA on all the dataset.
Direct answer to your first question (which I now see didn’t ask about timing).
It’s hard to say how the number of iterations before convergence depends on the number of variants, but apparently fewer iterations are required with more variants. I think more variants results in more nodes in the treeAggregate, which corresponds to more tasks per iteration.
Neither of these should depend on the number of partitions.