Help with first commands / combine on local storage

Hi folks,

I’m working a beginner who is working through a singularity container that is running Hail 0.2.93-d77cdf0157c9.

I can run the basic test without error…

import hail as hl
   ...: mt = hl.balding_nichols_model(n_populations=3,
   ...:                               n_samples=10,
   ...:                               n_variants=100)

For my next trick I wanted to try loading a vcf into the combine command. For this I used the following commands. I am testing using local storage on a machine with 2Tb of RAM and 128cpus.

combiner = hl.vds.new_combiner(
   ...:     output_path='file:///projects/rcorbettprj2/hail/dataset.vds',
   ...:     temp_path='file:///projects/rcorbettprj2/hail',
   ...:     gvcf_paths=vcf,
   ...:     use_genome_default_intervals=True,
   ...:     )

which gives the following error trace:

2022-11-02 07:30:27 Hail: WARN: expected input file 'file:/' to end in .vcf[.bgz, .gz]
FatalError                                Traceback (most recent call last)
<ipython-input-4-aeea38f2bb9e> in <module>
----> 1 combiner = hl.vds.new_combiner(
      2     output_path='file:///projects/rcorbettprj2/hail/dataset.vds',
      3     temp_path='file:///projects/rcorbettprj2/hail',
      4     gvcf_paths=vcf,
      5     use_genome_default_intervals=True,
FatalError: FileNotFoundException: 'file:/' is a directory (or native Table/MatrixTable)

Java stack trace: 'file:/' is a directory (or native Table/MatrixTable)
Hail version: 0.2.93-d77cdf0157c9
Error summary: FileNotFoundException: 'file:/' is a directory (or native Table/MatrixTable)

I have confirmed that the vcf is readable in this session, I just can’t seem to get hail to use it.

Can you point out my error?


I think try passing [vcf] instead of vcf in gvcf_paths – that arg expects a sequence of strings. Unfortunately, a single string is a sequence of strings, so I think this is being interpreted as ['/', 'p', 'a', 't', 'h',...]

Great. Seems we’re on the same page!
Now I’m getting some java Overhead errors. It is still running, but looks suspect with these errors:
2022-11-02 08:09:26 Hail: INFO: Running VDS combiner:
    VDS arguments: 0 datasets with 0 samples
    GVCF arguments: 1 inputs/samples
    Branch factor: 100
    GVCF merge batch size: 58
2022-11-02 08:09:27 Hail: INFO: GVCF combine (job 1): merging 1 GVCFs into 1 datasets
^[[I^[[OException in thread "Spark Context Cleaner" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "refresh progress" java.lang.OutOfMemoryError: GC overhead limit exceeded
^[[OException in thread "dispatcher-event-loop-124" java.lang.BootstrapMethodError: call site initialization exception
	at java.lang.invoke.CallSite.makeSite(
	at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(
	at java.lang.invoke.MethodHandleNatives.linkCallSite(
	at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:149)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$
	at java.util.concurrent.ThreadPoolExecutor.runWorker(
	at java.util.concurrent.ThreadPoolExecutor$
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "stop-spark-context" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Executor task launch worker for task 59.0 in stage 13.0 (TID 137)" java.lang.OutOfMemoryError: GC overhead limit exceeded

See here: How do I increase the memory or RAM available to the JVM when I start Hail through Python?

Default memory reservation isn’t enough for small Spark data structures.

I had to move it up to 48Gb to load 1000 gVCFs.

how many cores does your machine have? 48G sounds like a lot unless you’ve got something like 16+ cores (and 16+ parallel threads processing data)

I have 144 processors ( Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz) with 1.5Tb of RAM. Its a shared compute node. I first tried to 8Gb as in the example you linked, but got java heap errors when trying to load a big bolus of VCFs. I didn’t test any values between 8Gb and 48Gb, so it is quite possible I only needed 10.

Do you want to use all 144? You can initialize Hail to use less:

import hail as hl