Hi folks,
I’m working a beginner who is working through a singularity container that is running Hail 0.2.93-d77cdf0157c9.
I can run the basic test without error…
import hail as hl
...: mt = hl.balding_nichols_model(n_populations=3,
...: n_samples=10,
...: n_variants=100)
...: mt.show()
For my next trick I wanted to try loading a vcf into the combine
command. For this I used the following commands. I am testing using local storage on a machine with 2Tb of RAM and 128cpus.
vcf='/path/to/BQC12345.md.snaut.recal.GRCh38_full_analysis_set_plus_decoy_hla.g.vcf.gz.filtered.g.vcf.gz'
combiner = hl.vds.new_combiner(
...: output_path='file:///projects/rcorbettprj2/hail/dataset.vds',
...: temp_path='file:///projects/rcorbettprj2/hail',
...: gvcf_paths=vcf,
...: use_genome_default_intervals=True,
...: )
which gives the following error trace:
2022-11-02 07:30:27 Hail: WARN: expected input file 'file:/' to end in .vcf[.bgz, .gz]
---------------------------------------------------------------------------
FatalError Traceback (most recent call last)
<ipython-input-4-aeea38f2bb9e> in <module>
----> 1 combiner = hl.vds.new_combiner(
2 output_path='file:///projects/rcorbettprj2/hail/dataset.vds',
3 temp_path='file:///projects/rcorbettprj2/hail',
4 gvcf_paths=vcf,
5 use_genome_default_intervals=True,
...
FatalError: FileNotFoundException: 'file:/' is a directory (or native Table/MatrixTable)
Java stack trace:
java.io.FileNotFoundException: 'file:/' is a directory (or native Table/MatrixTable)
at is.hail.io.fs.HadoopFS.openNoCompression(HadoopFS.scala:89)
at is.hail.io.fs.FS.open(FS.scala:140)
at is.hail.io.fs.FS.open$(FS.scala:139)
at is.hail.io.fs.HadoopFS.open(HadoopFS.scala:72)
at is.hail.io.fs.FS.open(FS.scala:152)
...
Hail version: 0.2.93-d77cdf0157c9
Error summary: FileNotFoundException: 'file:/' is a directory (or native Table/MatrixTable)
I have confirmed that the vcf
is readable in this session, I just can’t seem to get hail to use it.
Can you point out my error?
thanks,
Richard
I think try passing [vcf]
instead of vcf
in gvcf_paths – that arg expects a sequence of strings. Unfortunately, a single string is a sequence of strings, so I think this is being interpreted as ['/', 'p', 'a', 't', 'h',...]
Great. Seems we’re on the same page!
Now I’m getting some java Overhead errors. It is still running, but looks suspect with these errors:
combiner.run()
2022-11-02 08:09:26 Hail: INFO: Running VDS combiner:
VDS arguments: 0 datasets with 0 samples
GVCF arguments: 1 inputs/samples
Branch factor: 100
GVCF merge batch size: 58
2022-11-02 08:09:27 Hail: INFO: GVCF combine (job 1): merging 1 GVCFs into 1 datasets
^[[I^[[OException in thread "Spark Context Cleaner" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "refresh progress" java.lang.OutOfMemoryError: GC overhead limit exceeded
^[[OException in thread "dispatcher-event-loop-124" java.lang.BootstrapMethodError: call site initialization exception
at java.lang.invoke.CallSite.makeSite(CallSite.java:341)
at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
at java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:149)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "stop-spark-context" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Executor task launch worker for task 59.0 in stage 13.0 (TID 137)" java.lang.OutOfMemoryError: GC overhead limit exceeded
Thanks!
I had to move it up to 48Gb to load 1000 gVCFs.
how many cores does your machine have? 48G sounds like a lot unless you’ve got something like 16+ cores (and 16+ parallel threads processing data)
I have 144 processors ( Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz) with 1.5Tb of RAM. Its a shared compute node. I first tried to 8Gb as in the example you linked, but got java heap errors when trying to load a big bolus of VCFs. I didn’t test any values between 8Gb and 48Gb, so it is quite possible I only needed 10.
Do you want to use all 144? You can initialize Hail to use less:
import hail as hl
hl.init(master='local[24]')