How to speed up hail PCA analysis

tpoterba · July 12, 2018, 2:29pm

can you do one more with “jobs”

trptyrphe · July 12, 2018, 2:29pm

tpoterba · July 12, 2018, 2:31pm

and there are a bunch of other treeAggregate jobs below this one, right?

trptyrphe · July 12, 2018, 2:36pm

no, this is the only one.

tpoterba · July 12, 2018, 2:36pm

it’s only been going 1.6 minutes – you said it was taking hours?

trptyrphe · July 12, 2018, 2:41pm

Yes, hours. there are many completed jobs below, from the same pca command.

tpoterba · July 12, 2018, 2:43pm

ah, that was my question above.

How many samples / variants do you have going into PCA?

trptyrphe · July 12, 2018, 2:44pm

~500 samples and variants after pruning is around 200K.

tpoterba · July 12, 2018, 2:44pm

it also looks like you’re only using ~30-40 cores here, which could probably be bumped up significantly.

trptyrphe · July 12, 2018, 2:46pm

how many cores/memory/workers do you recommend?

tpoterba · July 12, 2018, 2:52pm

You could use 200-400 without incurring much extra cost, I think.

The stage output looks like you’re spending a lot of time in GC – this is fixed in 0.2.

trptyrphe · July 12, 2018, 5:11pm

you mean driver + worker total cores 200-400?

tpoterba · July 12, 2018, 5:13pm

Yeah, I’d recommend bumping up to something like that.

trptyrphe · July 12, 2018, 5:37pm

So how many jobs am I expecting for the pca step, any formula to calculate? (I am applying 320 cores and the job # keeps increasing and it’s still running)

tpoterba · July 12, 2018, 5:51pm

The PCA step can require hundreds of iterative jobs. We use a Spark algorithm for distributed SVD.

trptyrphe · July 12, 2018, 5:57pm

Thanks, so each job represents one slice of distributed SVD? and job to job are sequential?

Topic		Replies	Views
Pca: CPU monitoring Hail Query & hailctl	10	822	August 15, 2018
Running Hail on Databricks Help [0.1]	5	1383	March 29, 2017
Counting Rows More Quickly in VDS Hail Query & hailctl	12	529	July 17, 2023
PCA Projection onto existing PCA Hail Query & hailctl	5	459	September 22, 2023
Question about Ordering unsorted dataset with network shuffle Hail Query & hailctl	0	146	May 16, 2024

How to speed up hail PCA analysis

Related topics