Expanding cores available to Hail by using extra VMs

hhx037 · November 18, 2018, 11:08am

Sorry if this is not a Hail question per se, but I suppose you’d know the answer.
I am limited to 46 cores per VM, and at the moment I’m successfully running Hail 0.2 on such VM.
I have to analyse UK Biobank so I think it would be a good idea to have more cores available for the computations. I can create other VMs (also with 46 cores), so my questions are:

Is it possible to expand Hail computations onto cores from other VMs using spark submit?
If so, on these external VMs, would I have to install Hail etc on top of Spark?
Apart from the RAM per core, will I need to associate large local disk space for caching or something like that?

hhx037 · February 11, 2019, 2:56pm

Hi,

so, I now have access to two 48-core VMs and I installed Spark on both, following the following instructions https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b

Spark seems to be up and running on both the master and slave.

Then I proceeded to install conda and Hail on the Master. It seems Hail jobs only run on 48 cores, so it doesn’t use the Slave VM at all.

Any pointers?

tpoterba · February 11, 2019, 3:39pm

Did you install Hail using pip? If so, it may be using a local installation by default.

You can try setting the master in hl.init(master='...') to the uri for the spark master process.

hhx037 · February 11, 2019, 3:46pm

No, I got hail from Github and ran

./gradlew -Dspark.version=2.2.0 shadowJar archiveZip

Adding master to hl.init throws an error:

hl.init(master=‘10.4.1.208’)
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File “”, line 1, in
File “”, line 2, in init
File “/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py”, line 561, in wrapper
File “/usr/local/hail/build/distributions/hail-python.zip/hail/context.py”, line 256, in init
File “”, line 2, in init
File “/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py”, line 561, in wrapper
File “/usr/local/hail/build/distributions/hail-python.zip/hail/context.py”, line 97, in init
File “/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py”, line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:is.hail.HailContext.apply.
: org.apache.spark.SparkException: Could not parse Master URL: ‘10.4.1.208’
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2760)
at org.apache.spark.SparkContext.(SparkContext.scala:501)
at is.hail.HailContext$.configureAndCreateSparkContext(HailContext.scala:112)
at is.hail.HailContext$.apply(HailContext.scala:237)
at is.hail.HailContext.apply(HailContext.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

tpoterba · February 11, 2019, 4:03pm

When you use plain Spark, can you use all 96 cores?

It’s also possible that your memory settings are such that Spark can’t allocate all the cores to preserve the memory:core ratio specified in the configuration.

hhx037 · February 11, 2019, 4:42pm

In my bashrc I have these options for export PYSPARK_SUBMIT_ARGS:

–executor-memory 8g
–driver-memory 8g pyspark-shell

and do have 8GB or RAM per core.

How can I test the plain Spark use of cores? (sorry, I only use Spark with Hail, and am not very familiar with it)

tpoterba · February 11, 2019, 4:56pm

try:

import pyspark
pyspark.SparkContext().range(1_000_000_000, numSlices=1000).count()

hhx037 · February 11, 2019, 5:08pm

Great, thanks.

It used only 48 cores

tpoterba · February 11, 2019, 5:09pm

try bringing the executor memory down to 7G. There’s some complex calculus involving overheads and a bunch of other memory settings.

hhx037 · February 11, 2019, 5:19pm

Still only 48.

tpoterba · February 11, 2019, 5:27pm

Is it using 48 cores on one machine or 24 on each?

hhx037 · February 11, 2019, 5:49pm

How can I check that?

Nine](http://www.9folders.com/)

tpoterba · February 11, 2019, 6:03pm

Can you open the WebUI? It’s port 4040 on the driver machine.

hhx037 · February 12, 2019, 10:23am

OK, managed to get my X11 forwarding working, sorry for the delay.

I’m not sure what to llook for, so here are the screenshots of the :8080 and :4040

tpoterba · February 12, 2019, 10:42am

huh, this looks like a properly networked 2-node cluster.

It’s possible that the driver nodes aren’t contributing to the execution core pool. The executor size is also really big here, we’ve found things work well with ~4-8 core executors. Did you set up the cluster yourself, or is it a standing cluster set up by an IT team?

hhx037 · February 12, 2019, 11:02am

I created the VMs myself, but I am limited as to the “flavours” of the VMs (number of cores, RAM etc), which are made available by the central IT team.

tpoterba · February 12, 2019, 11:03am

Can you try redeploying the cluster with some modified config:

spark.executor.cores=8
spark.driver.cores=8

hhx037 · February 12, 2019, 11:05am

Is this in the spark-env.sh configuration file?

tpoterba · February 12, 2019, 11:06am

should go in spark-defaults.conf in the same folder, I think

hhx037 · February 12, 2019, 11:21am

OK, did that and restarted Spark.

I could see the number after “+” briefly go over 48 from time to time (not sure if it means anything, just to let you know)

[Stage 0:==================> (3618 + 50) / 10000]

Topic		Replies	Views
Running hail locally - number of cores Hail Query & hailctl	3	808	March 28, 2023
Running HAIL on more than one compute node Hail Query & hailctl	0	161	February 15, 2024
Running Hail with a remote Spark Help [0.1]	10	2768	June 2, 2017
Hail on Spark cluster! Help [0.1]	2	840	March 20, 2018
Questions about optimizing Hail and Spark configs and estimating resources and runtimes Hail Query & hailctl	3	1146	December 1, 2022

Expanding cores available to Hail by using extra VMs

Related topics