How to speed up hail PCA analysis

can you do one more with “jobs”

and there are a bunch of other treeAggregate jobs below this one, right?

no, this is the only one.

it’s only been going 1.6 minutes – you said it was taking hours?

Yes, hours. there are many completed jobs below, from the same pca command.

ah, that was my question above.

How many samples / variants do you have going into PCA?

~500 samples and variants after pruning is around 200K.

it also looks like you’re only using ~30-40 cores here, which could probably be bumped up significantly.

how many cores/memory/workers do you recommend?

You could use 200-400 without incurring much extra cost, I think.

The stage output looks like you’re spending a lot of time in GC – this is fixed in 0.2.

you mean driver + worker total cores 200-400?

Yeah, I’d recommend bumping up to something like that.

So how many jobs am I expecting for the pca step, any formula to calculate? (I am applying 320 cores and the job # keeps increasing and it’s still running)

The PCA step can require hundreds of iterative jobs. We use a Spark algorithm for distributed SVD.

Thanks, so each job represents one slice of distributed SVD? and job to job are sequential?