Help with first commands / combine on local storage

Hi folks,

I’m working a beginner who is working through a singularity container that is running Hail 0.2.93-d77cdf0157c9.

I can run the basic test without error…

import hail as hl
   ...: mt = hl.balding_nichols_model(n_populations=3,
   ...:                               n_samples=10,
   ...:                               n_variants=100)
   ...: mt.show()

For my next trick I wanted to try loading a vcf into the combine command. For this I used the following commands. I am testing using local storage on a machine with 2Tb of RAM and 128cpus.

vcf='/path/to/BQC12345.md.snaut.recal.GRCh38_full_analysis_set_plus_decoy_hla.g.vcf.gz.filtered.g.vcf.gz'
combiner = hl.vds.new_combiner(
   ...:     output_path='file:///projects/rcorbettprj2/hail/dataset.vds',
   ...:     temp_path='file:///projects/rcorbettprj2/hail',
   ...:     gvcf_paths=vcf,
   ...:     use_genome_default_intervals=True,
   ...:     )

which gives the following error trace:

2022-11-02 07:30:27 Hail: WARN: expected input file 'file:/' to end in .vcf[.bgz, .gz]
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-4-aeea38f2bb9e> in <module>
----> 1 combiner = hl.vds.new_combiner(
      2     output_path='file:///projects/rcorbettprj2/hail/dataset.vds',
      3     temp_path='file:///projects/rcorbettprj2/hail',
      4     gvcf_paths=vcf,
      5     use_genome_default_intervals=True,
...
FatalError: FileNotFoundException: 'file:/' is a directory (or native Table/MatrixTable)

Java stack trace:
java.io.FileNotFoundException: 'file:/' is a directory (or native Table/MatrixTable)
	at is.hail.io.fs.HadoopFS.openNoCompression(HadoopFS.scala:89)
	at is.hail.io.fs.FS.open(FS.scala:140)
	at is.hail.io.fs.FS.open$(FS.scala:139)
	at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:72)
	at is.hail.io.fs.FS.open(FS.scala:152)
...
Hail version: 0.2.93-d77cdf0157c9
Error summary: FileNotFoundException: 'file:/' is a directory (or native Table/MatrixTable)

I have confirmed that the vcf is readable in this session, I just can’t seem to get hail to use it.

Can you point out my error?

thanks,
Richard

I think try passing [vcf] instead of vcf in gvcf_paths – that arg expects a sequence of strings. Unfortunately, a single string is a sequence of strings, so I think this is being interpreted as ['/', 'p', 'a', 't', 'h',...]

Great. Seems we’re on the same page!
Now I’m getting some java Overhead errors. It is still running, but looks suspect with these errors:

combiner.run()
2022-11-02 08:09:26 Hail: INFO: Running VDS combiner:
    VDS arguments: 0 datasets with 0 samples
    GVCF arguments: 1 inputs/samples
    Branch factor: 100
    GVCF merge batch size: 58
2022-11-02 08:09:27 Hail: INFO: GVCF combine (job 1): merging 1 GVCFs into 1 datasets
^[[I^[[OException in thread "Spark Context Cleaner" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "refresh progress" java.lang.OutOfMemoryError: GC overhead limit exceeded
         
^[[OException in thread "dispatcher-event-loop-124" java.lang.BootstrapMethodError: call site initialization exception
	at java.lang.invoke.CallSite.makeSite(CallSite.java:341)
	at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
	at java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
	at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:149)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "stop-spark-context" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Executor task launch worker for task 59.0 in stage 13.0 (TID 137)" java.lang.OutOfMemoryError: GC overhead limit exceeded

See here: How do I increase the memory or RAM available to the JVM when I start Hail through Python?

Default memory reservation isn’t enough for small Spark data structures.

Thanks!
I had to move it up to 48Gb to load 1000 gVCFs.

how many cores does your machine have? 48G sounds like a lot unless you’ve got something like 16+ cores (and 16+ parallel threads processing data)

I have 144 processors ( Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz) with 1.5Tb of RAM. Its a shared compute node. I first tried to 8Gb as in the example you linked, but got java heap errors when trying to load a big bolus of VCFs. I didn’t test any values between 8Gb and 48Gb, so it is quite possible I only needed 10.

Do you want to use all 144? You can initialize Hail to use less:

import hail as hl
hl.init(master='local[24]')