How to speed up hail PCA analysis

Hi,

I am running the vds_pruned.pca() step on databricks with 1 driver and 10 workers. The data set comes from 500 samples WES joint variant calling (so fairly large) and it has been hours and still running. I am wondering is there any recommended cluster setting or method to speed up? Thanks.

PCA is very sensitive to the number of partitions in your dataset. How many partitions does the dataset have before PCA is executed?

Are you sure it’s PCA and not some upstream step that’s slow? LD prune has been vastly improved in 0.2, for instance.

Yes I am sure it’s pca because I used the pruned object as input for this function. And seems like increasing # workers doesn’t help.

vds_pruned = vds.ld_prune(memory_per_core=1024, num_cores=128)
vds_pca = vds_pruned.pca(‘sa.scores’, k=5)

Can we see the full pipeline including upstream of ld prune? This is all getting executed at the same time.

vds = vds.split_multi()
vds_qc = vds.variant_qc()
vds_sample_qc = vds_qc.sample_qc()
vds_common = vds_qc.filter_variants_expr(’(v.contig == “X”) || (v.contig == “Y”) || (va.qc.AF < 0.05) || (va.qc.AF > 0.95)’, keep=False)
vds_common_gq = vds_common.filter_genotypes(’(v.altAllele.isSNP() && g.gq < 20) || (v.altAllele.isIndel() && g.gq < 60)’,keep=False)
vds_pruned = vds_common_gq.ld_prune(memory_per_core=1024, num_cores=128)
vds_pruned = vds_pruned.cache()

From this pipeline, I’m guessing the problem is LD prune rather than PCA.

do you mean vds_pca = vds_pruned.pca(‘sa.scores’, k=5) step will redo ld pruning? otherwise I apply these steps in databricks notebook sequential blocks and only the pca step gets stuck.

Hail follows a lazy execution model (similar to Spark) and there’s no guarantee that an operation is actually executed in the same line of code it’s declared. In this case, everything after the sample_qc is executed when you call PCA.

what if I save the vds_pruned to disk and restart my job by just loading vds_pruned and pca calling?

that would be a good idea. However, if pruning is the problem and not PCA, that won’t solve it!

the thing is, I find increasing # of workers can largely speed up pruning step, but not the PCA step, something to do with linear algebra not optimized for parallel computing?

can you show us the Spark WebUI output? How many partitions does your dataset have?

This?

DATABRICKS_STDOUT_END-dfebc7df-a01f-425f-b20c-a1e64f952a3e-1531397313966
Running on Apache Spark version 2.1.1
SparkUI available at http://10.192.154.117:44658
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
/
/ //_,/// version 0.1-20613ed
2018-07-12 12:13:13 Hail: INFO: Running PCA with 5 components…
2018-07-12T12:13:41.704+0000: [GC (Allocation Failure) [PSYoungGen: 11111158K->1735583K(13024256K)] 11329091K->1953532K(42793984K), 0.8531040 secs] [Times: user=5.01 sys=1.55, real=0.85 secs]
2018-07-12T12:14:32.774+0000: [GC (Allocation Failure) [PSYoungGen: 12878841K->992740K(13118976K)] 13096790K->1583532K(42888704K), 0.6273351 secs] [Times: user=3.22 sys=1.71, real=0.63 secs]
2018-07-12T12:26:02.515+0000: [GC (Allocation Failure) [PSYoungGen: 13118948K->1379302K(13330432K)] 13709740K->2299045K(43100160K), 0.5131110 secs] [Times: user=2.97 sys=1.07, real=0.52 secs]
2018-07-12T12:33:00.622+0000: [GC (Allocation Failure) [PSYoungGen: 13169855K->1379309K(12251136K)] 14089598K->2383776K(42020864K), 0.3021412 secs] [Times: user=2.04 sys=0.30, real=0.30 secs]
2018-07-12T12:38:12.533+0000: [GC (System.gc()) [PSYoungGen: 10698875K->1177355K(12049408K)] 11703342K->2181831K(41819136K), 0.1900331 secs] [Times: user=1.45 sys=0.00, real=0.19 secs]
2018-07-12T12:38:12.723+0000: [Full GC (System.gc()) [PSYoungGen: 1177355K->0K(12049408K)] [ParOldGen: 1004475K->1770245K(29769728K)] 2181831K->1770245K(41819136K), [Metaspace: 147201K->147049K(149504K)], 2.4546218 secs] [Times: user=15.66 sys=0.24, real=2.45 secs]
2018-07-12T12:44:10.443+0000: [GC (Allocation Failure) [PSYoungGen: 10871808K->293021K(12885504K)] 12642053K->2063274K(42655232K), 0.0385196 secs] [Times: user=0.28 sys=0.00, real=0.04 secs]
2018-07-12T12:50:19.660+0000: [GC (Allocation Failure) [PSYoungGen: 11051947K->435345K(12818944K)] 12822200K->2205607K(42588672K), 0.0395039 secs] [Times: user=0.29 sys=0.00, real=0.03 secs]
2018-07-12T12:56:54.061+0000: [GC (Allocation Failure) [PSYoungGen: 11254929K->750888K(13066240K)] 13025191K->2521157K(42835968K), 0.1679928 secs] [Times: user=1.32 sys=0.00, real=0.17 secs]
2018-07-12T13:04:10.441+0000: [GC (Allocation Failure) [PSYoungGen: 11913512K->444816K(12981248K)] 13683781K->2215093K(42750976K), 0.0347225 secs] [Times: user=0.23 sys=0.00, real=0.03 secs]
2018-07-12T13:08:12.533+0000: [GC (System.gc()) [PSYoungGen: 6145115K->300172K(13202944K)] 7915392K->2070457K(42972672K), 0.0306805 secs] [Times: user=0.19 sys=0.00, real=0.03 secs]
2018-07-12T13:08:12.563+0000: [Full GC (System.gc()) [PSYoungGen: 300172K->0K(13202944K)] [ParOldGen: 1770285K->1916767K(29769728K)] 2070457K->1916767K(42972672K), [Metaspace: 147560K->147560K(149504K)], 2.3451885 secs] [Times: user=17.45 sys=0.13, real=2.35 secs]
2018-07-12T13:16:51.474+0000: [GC (Allocation Failure) [PSYoungGen: 11466240K->307578K(13148160K)] 13383007K->2224353K(42917888K), 0.0254219 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
2018-07-12T13:23:40.846+0000: [GC (Allocation Failure) [PSYoungGen: 11763994K->462330K(13349888K)] 13680769K->2379113K(43119616K), 0.0555873 secs] [Times: user=0.37 sys=0.01, real=0.06 secs]
2018-07-12T13:30:36.121+0000: [GC (Allocation Failure) [PSYoungGen: 12009585K->626858K(13262336K)] 13926368K->2543649K(43032064K), 0.0945164 secs] [Times: user=0.71 sys=0.00, real=0.09 secs]
2018-07-12T13:37:34.970+0000: [GC (Allocation Failure) [PSYoungGen: 12354218K->327418K(13471232K)] 14271009K->2244217K(43240960K), 0.0279697 secs] [Times: user=0.22 sys=0.00, real=0.02 secs]
2018-07-12T13:38:12.533+0000: [GC (System.gc()) [PSYoungGen: 2813897K->155622K(13421568K)] 4730696K->2072429K(43191296K), 0.0206759 secs] [Times: user=0.15 sys=0.00, real=0.02 secs]
2018-07-12T13:38:12.553+0000: [Full GC (System.gc()) [PSYoungGen: 155622K->0K(13421568K)] [ParOldGen: 1916807K->1776305K(29769728K)] 2072429K->1776305K(43191296K), [Metaspace: 147831K->147831K(151552K)], 1.9034749 secs] [Times: user=14.15 sys=0.03, real=1.90 secs]
2018-07-12T13:47:35.136+0000: [GC (Allocation Failure) [PSYoungGen: 12007936K->405128K(13565952K)] 13784241K->2181441K(43335680K), 0.0497494 secs] [Times: user=0.33 sys=0.00, real=0.05 secs]
2018-07-12T13:54:39.228+0000: [GC (Allocation Failure) [PSYoungGen: 12454079K->591346K(13493248K)] 14230392K->2367667K(43262976K), 0.0931541 secs] [Times: user=0.61 sys=0.00, real=0.10 secs]

No, that’s the console output. What happens if you go to the IP we print there? I’m not sure how to access the webui on databricks.

the link doesn’t work, which information shall I pull? I saw sparkUI, and there are several tabs like stage, storage, environment, executor, etc.

a screenshot of “stages” and another of the running stage while doing PCA would be very useful, thanks!

active stage