How to speed up hail PCA analysis

trptyrphe · July 11, 2018, 7:03pm

Hi,

I am running the vds_pruned.pca() step on databricks with 1 driver and 10 workers. The data set comes from 500 samples WES joint variant calling (so fairly large) and it has been hours and still running. I am wondering is there any recommended cluster setting or method to speed up? Thanks.

danking · July 11, 2018, 9:07pm

PCA is very sensitive to the number of partitions in your dataset. How many partitions does the dataset have before PCA is executed?

tpoterba · July 11, 2018, 9:10pm

Are you sure it’s PCA and not some upstream step that’s slow? LD prune has been vastly improved in 0.2, for instance.

trptyrphe · July 11, 2018, 10:15pm

Yes I am sure it’s pca because I used the pruned object as input for this function. And seems like increasing # workers doesn’t help.

trptyrphe · July 11, 2018, 10:16pm

vds_pruned = vds.ld_prune(memory_per_core=1024, num_cores=128)
vds_pca = vds_pruned.pca(‘sa.scores’, k=5)

tpoterba · July 11, 2018, 10:39pm

Can we see the full pipeline including upstream of ld prune? This is all getting executed at the same time.

trptyrphe · July 12, 2018, 11:47am

vds = vds.split_multi()
vds_qc = vds.variant_qc()
vds_sample_qc = vds_qc.sample_qc()
vds_common = vds_qc.filter_variants_expr(’(v.contig == “X”) || (v.contig == “Y”) || (va.qc.AF < 0.05) || (va.qc.AF > 0.95)’, keep=False)
vds_common_gq = vds_common.filter_genotypes(’(v.altAllele.isSNP() && g.gq < 20) || (v.altAllele.isIndel() && g.gq < 60)’,keep=False)
vds_pruned = vds_common_gq.ld_prune(memory_per_core=1024, num_cores=128)
vds_pruned = vds_pruned.cache()

tpoterba · July 12, 2018, 12:06pm

From this pipeline, I’m guessing the problem is LD prune rather than PCA.

trptyrphe · July 12, 2018, 12:11pm

do you mean vds_pca = vds_pruned.pca(‘sa.scores’, k=5) step will redo ld pruning? otherwise I apply these steps in databricks notebook sequential blocks and only the pca step gets stuck.

tpoterba · July 12, 2018, 12:36pm

Hail follows a lazy execution model (similar to Spark) and there’s no guarantee that an operation is actually executed in the same line of code it’s declared. In this case, everything after the sample_qc is executed when you call PCA.

trptyrphe · July 12, 2018, 12:40pm

what if I save the vds_pruned to disk and restart my job by just loading vds_pruned and pca calling?

tpoterba · July 12, 2018, 12:52pm

that would be a good idea. However, if pruning is the problem and not PCA, that won’t solve it!

trptyrphe · July 12, 2018, 1:01pm

the thing is, I find increasing # of workers can largely speed up pruning step, but not the PCA step, something to do with linear algebra not optimized for parallel computing?

tpoterba · July 12, 2018, 1:32pm

can you show us the Spark WebUI output? How many partitions does your dataset have?

trptyrphe · July 12, 2018, 1:59pm

This?

DATABRICKS_STDOUT_END-dfebc7df-a01f-425f-b20c-a1e64f952a3e-1531397313966
Running on Apache Spark version 2.1.1
SparkUI available at http://10.192.154.117:44658
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
// //_,/// version 0.1-20613ed
2018-07-12 12:13:13 Hail: INFO: Running PCA with 5 components…
2018-07-12T12:13:41.704+0000: [GC (Allocation Failure) [PSYoungGen: 11111158K->1735583K(13024256K)] 11329091K->1953532K(42793984K), 0.8531040 secs] [Times: user=5.01 sys=1.55, real=0.85 secs]
2018-07-12T12:14:32.774+0000: [GC (Allocation Failure) [PSYoungGen: 12878841K->992740K(13118976K)] 13096790K->1583532K(42888704K), 0.6273351 secs] [Times: user=3.22 sys=1.71, real=0.63 secs]
2018-07-12T12:26:02.515+0000: [GC (Allocation Failure) [PSYoungGen: 13118948K->1379302K(13330432K)] 13709740K->2299045K(43100160K), 0.5131110 secs] [Times: user=2.97 sys=1.07, real=0.52 secs]
2018-07-12T12:33:00.622+0000: [GC (Allocation Failure) [PSYoungGen: 13169855K->1379309K(12251136K)] 14089598K->2383776K(42020864K), 0.3021412 secs] [Times: user=2.04 sys=0.30, real=0.30 secs]
2018-07-12T12:38:12.533+0000: [GC (System.gc()) [PSYoungGen: 10698875K->1177355K(12049408K)] 11703342K->2181831K(41819136K), 0.1900331 secs] [Times: user=1.45 sys=0.00, real=0.19 secs]
2018-07-12T12:38:12.723+0000: [Full GC (System.gc()) [PSYoungGen: 1177355K->0K(12049408K)] [ParOldGen: 1004475K->1770245K(29769728K)] 2181831K->1770245K(41819136K), [Metaspace: 147201K->147049K(149504K)], 2.4546218 secs] [Times: user=15.66 sys=0.24, real=2.45 secs]
2018-07-12T12:44:10.443+0000: [GC (Allocation Failure) [PSYoungGen: 10871808K->293021K(12885504K)] 12642053K->2063274K(42655232K), 0.0385196 secs] [Times: user=0.28 sys=0.00, real=0.04 secs]
2018-07-12T12:50:19.660+0000: [GC (Allocation Failure) [PSYoungGen: 11051947K->435345K(12818944K)] 12822200K->2205607K(42588672K), 0.0395039 secs] [Times: user=0.29 sys=0.00, real=0.03 secs]
2018-07-12T12:56:54.061+0000: [GC (Allocation Failure) [PSYoungGen: 11254929K->750888K(13066240K)] 13025191K->2521157K(42835968K), 0.1679928 secs] [Times: user=1.32 sys=0.00, real=0.17 secs]
2018-07-12T13:04:10.441+0000: [GC (Allocation Failure) [PSYoungGen: 11913512K->444816K(12981248K)] 13683781K->2215093K(42750976K), 0.0347225 secs] [Times: user=0.23 sys=0.00, real=0.03 secs]
2018-07-12T13:08:12.533+0000: [GC (System.gc()) [PSYoungGen: 6145115K->300172K(13202944K)] 7915392K->2070457K(42972672K), 0.0306805 secs] [Times: user=0.19 sys=0.00, real=0.03 secs]
2018-07-12T13:08:12.563+0000: [Full GC (System.gc()) [PSYoungGen: 300172K->0K(13202944K)] [ParOldGen: 1770285K->1916767K(29769728K)] 2070457K->1916767K(42972672K), [Metaspace: 147560K->147560K(149504K)], 2.3451885 secs] [Times: user=17.45 sys=0.13, real=2.35 secs]
2018-07-12T13:16:51.474+0000: [GC (Allocation Failure) [PSYoungGen: 11466240K->307578K(13148160K)] 13383007K->2224353K(42917888K), 0.0254219 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
2018-07-12T13:23:40.846+0000: [GC (Allocation Failure) [PSYoungGen: 11763994K->462330K(13349888K)] 13680769K->2379113K(43119616K), 0.0555873 secs] [Times: user=0.37 sys=0.01, real=0.06 secs]
2018-07-12T13:30:36.121+0000: [GC (Allocation Failure) [PSYoungGen: 12009585K->626858K(13262336K)] 13926368K->2543649K(43032064K), 0.0945164 secs] [Times: user=0.71 sys=0.00, real=0.09 secs]
2018-07-12T13:37:34.970+0000: [GC (Allocation Failure) [PSYoungGen: 12354218K->327418K(13471232K)] 14271009K->2244217K(43240960K), 0.0279697 secs] [Times: user=0.22 sys=0.00, real=0.02 secs]
2018-07-12T13:38:12.533+0000: [GC (System.gc()) [PSYoungGen: 2813897K->155622K(13421568K)] 4730696K->2072429K(43191296K), 0.0206759 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
2018-07-12T13:38:12.553+0000: [Full GC (System.gc()) [PSYoungGen: 155622K->0K(13421568K)] [ParOldGen: 1916807K->1776305K(29769728K)] 2072429K->1776305K(43191296K), [Metaspace: 147831K->147831K(151552K)], 1.9034749 secs] [Times: user=14.15 sys=0.03, real=1.90 secs]
2018-07-12T13:47:35.136+0000: [GC (Allocation Failure) [PSYoungGen: 12007936K->405128K(13565952K)] 13784241K->2181441K(43335680K), 0.0497494 secs] [Times: user=0.33 sys=0.00, real=0.05 secs]
2018-07-12T13:54:39.228+0000: [GC (Allocation Failure) [PSYoungGen: 12454079K->591346K(13493248K)] 14230392K->2367667K(43262976K), 0.0931541 secs] [Times: user=0.61 sys=0.00, real=0.10 secs]

tpoterba · July 12, 2018, 2:01pm

No, that’s the console output. What happens if you go to the IP we print there? I’m not sure how to access the webui on databricks.

trptyrphe · July 12, 2018, 2:11pm

the link doesn’t work, which information shall I pull? I saw sparkUI, and there are several tabs like stage, storage, environment, executor, etc.

trptyrphe · July 12, 2018, 2:19pm

tpoterba · July 12, 2018, 2:22pm

a screenshot of “stages” and another of the running stage while doing PCA would be very useful, thanks!

trptyrphe · July 12, 2018, 2:27pm

active stage

Topic		Replies	Views
Pca: CPU monitoring Hail Query & hailctl	10	821	August 15, 2018
Running Hail on Databricks Help [0.1]	5	1383	March 29, 2017
Counting Rows More Quickly in VDS Hail Query & hailctl	12	527	July 17, 2023
PCA Projection onto existing PCA Hail Query & hailctl	5	456	September 22, 2023
Question about Ordering unsorted dataset with network shuffle Hail Query & hailctl	0	146	May 16, 2024

How to speed up hail PCA analysis

Related topics