Expanding cores available to Hail by using extra VMs


#1

Sorry if this is not a Hail question per se, but I suppose you’d know the answer.
I am limited to 46 cores per VM, and at the moment I’m successfully running Hail 0.2 on such VM.
I have to analyse UK Biobank so I think it would be a good idea to have more cores available for the computations. I can create other VMs (also with 46 cores), so my questions are:

  1. Is it possible to expand Hail computations onto cores from other VMs using spark submit?
  2. If so, on these external VMs, would I have to install Hail etc on top of Spark?
  3. Apart from the RAM per core, will I need to associate large local disk space for caching or something like that?

#2

Hi,

so, I now have access to two 48-core VMs and I installed Spark on both, following the following instructions https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b

Spark seems to be up and running on both the master and slave.

Then I proceeded to install conda and Hail on the Master. It seems Hail jobs only run on 48 cores, so it doesn’t use the Slave VM at all.

Any pointers?


#3

Did you install Hail using pip? If so, it may be using a local installation by default.

You can try setting the master in hl.init(master='...') to the uri for the spark master process.


#4

No, I got hail from Github and ran

./gradlew -Dspark.version=2.2.0 shadowJar archiveZip

Adding master to hl.init throws an error:

hl.init(master=‘10.4.1.208’)
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File “”, line 1, in
File “”, line 2, in init
File “/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py”, line 561, in wrapper
File “/usr/local/hail/build/distributions/hail-python.zip/hail/context.py”, line 256, in init
File “”, line 2, in init
File “/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py”, line 561, in wrapper
File “/usr/local/hail/build/distributions/hail-python.zip/hail/context.py”, line 97, in init
File “/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py”, line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:is.hail.HailContext.apply.
: org.apache.spark.SparkException: Could not parse Master URL: ‘10.4.1.208’
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2760)
at org.apache.spark.SparkContext.(SparkContext.scala:501)
at is.hail.HailContext$.configureAndCreateSparkContext(HailContext.scala:112)
at is.hail.HailContext$.apply(HailContext.scala:237)
at is.hail.HailContext.apply(HailContext.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)


#5

When you use plain Spark, can you use all 96 cores?

It’s also possible that your memory settings are such that Spark can’t allocate all the cores to preserve the memory:core ratio specified in the configuration.


#6

In my bashrc I have these options for export PYSPARK_SUBMIT_ARGS:

–executor-memory 8g
–driver-memory 8g pyspark-shell

and do have 8GB or RAM per core.

How can I test the plain Spark use of cores? (sorry, I only use Spark with Hail, and am not very familiar with it)


#7

try:

import pyspark
pyspark.SparkContext().range(1_000_000_000, numSlices=1000).count()

#8

Great, thanks.

It used only 48 cores :confused:


#9

try bringing the executor memory down to 7G. There’s some complex calculus involving overheads and a bunch of other memory settings.


#10

Still only 48.


#11

Is it using 48 cores on one machine or 24 on each?


#12

How can I check that?

Nine](http://www.9folders.com/)


#13

Can you open the WebUI? It’s port 4040 on the driver machine.


#14

OK, managed to get my X11 forwarding working, sorry for the delay.

I’m not sure what to llook for, so here are the screenshots of the :8080 and :4040


#15

huh, this looks like a properly networked 2-node cluster.

It’s possible that the driver nodes aren’t contributing to the execution core pool. The executor size is also really big here, we’ve found things work well with ~4-8 core executors. Did you set up the cluster yourself, or is it a standing cluster set up by an IT team?


#16

I created the VMs myself, but I am limited as to the “flavours” of the VMs (number of cores, RAM etc), which are made available by the central IT team.


#17

Can you try redeploying the cluster with some modified config:

spark.executor.cores=8
spark.driver.cores=8


#18

Is this in the spark-env.sh configuration file?


#19

should go in spark-defaults.conf in the same folder, I think


#20

OK, did that and restarted Spark.

I could see the number after “+” briefly go over 48 from time to time (not sure if it means anything, just to let you know)

[Stage 0:==================> (3618 + 50) / 10000]