can you do one more with “jobs”
and there are a bunch of other treeAggregate jobs below this one, right?
no, this is the only one.
it’s only been going 1.6 minutes – you said it was taking hours?
Yes, hours. there are many completed jobs below, from the same pca command.
ah, that was my question above.
How many samples / variants do you have going into PCA?
~500 samples and variants after pruning is around 200K.
it also looks like you’re only using ~30-40 cores here, which could probably be bumped up significantly.
how many cores/memory/workers do you recommend?
You could use 200-400 without incurring much extra cost, I think.
The stage output looks like you’re spending a lot of time in GC – this is fixed in 0.2.
you mean driver + worker total cores 200-400?
Yeah, I’d recommend bumping up to something like that.
So how many jobs am I expecting for the pca step, any formula to calculate? (I am applying 320 cores and the job # keeps increasing and it’s still running)
The PCA step can require hundreds of iterative jobs. We use a Spark algorithm for distributed SVD.
Thanks, so each job represents one slice of distributed SVD? and job to job are sequential?