Load vcf from SMB share

quantumbarista · September 1, 2021, 7:44pm

I’m trying to load VCF files from an Isilon server (since I don’t have TB of storage on my laptop).

The problem seems to be that Hadoop doesn’t like the : character in my SMB relative file path. I can open the file at this path as a file object in python so I know the path is good.

Any suggestions?

code:
filename=’/run/user/110911/gvfs/smb-share:server=usfc,share=03247-G_1.recalibrated.haplotypeCalls.annotated.vcf’

mt = hl.import_vcf(filename)

traceback:

FatalError: URISyntaxException: Relative path in absolute URI: smb-share:server=usfc

Java stack trace:
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: smb-share:server=usfc,
at org.apache.hadoop.fs.Path.initialize(Path.java:259)
at org.apache.hadoop.fs.Path.(Path.java:217)
at org.apache.hadoop.fs.Path.(Path.java:125)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:229)
at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2016)
at is.hail.io.fs.HadoopFS.glob(HadoopFS.scala:155)
at is.hail.io.fs.HadoopFS.$anonfun$globAll$1(HadoopFS.scala:134)
at is.hail.io.fs.HadoopFS.$anonfun$globAll$1$adapted(HadoopFS.scala:133)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at is.hail.io.fs.HadoopFS.globAll(HadoopFS.scala:139)
at is.hail.io.vcf.MatrixVCFReader$.apply(LoadVCF.scala:1572)
at is.hail.io.vcf.MatrixVCFReader$.fromJValue(LoadVCF.scala:1667)
at is.hail.expr.ir.MatrixReader$.fromJson(MatrixIR.scala:92)
at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1714)
at is.hail.expr.ir.IRParser$.$anonfun$matrix_ir$1(Parser.scala:1640)
at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
at is.hail.expr.ir.IRParser$.$anonfun$parse_matrix_ir$1(Parser.scala:1980)
at is.hail.expr.ir.IRParser$.parse(Parser.scala:1967)
at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1980)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$2(SparkBackend.scala:653)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$1(SparkBackend.scala:652)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.parse_matrix_ir(SparkBackend.scala:651)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Hail version: 0.2.74-0c3a74d12093

quantumbarista · September 1, 2021, 7:48pm

I’ve also tried following the solution here:

using:
spark-shell --conf spark.sql.warehouse.dir=file:///c:/tmp/spark-warehouse

But doesn’t change the error message

danking · September 2, 2021, 6:47pm

OK, so the issue here is that spark expects URIs, not file paths. In a URI, everything before the colon is the scheme. Spark thinks you’re using some scheme/protocol called /run/user/110911/gvfs/smb-share and within that scheme you want a file called server=usfc,share=03247-G_1.recalibrated.haplotypeCalls.annotated.vcf. Because this file does not begin with a /, it is a relative path.

In general, in Hadoop, if you want to use a local file path, you have to explicitly include the file: scheme. In your case, I think the correct URI is:

file:/run/user/110911/gvfs/smb-share:server=usfc,share=03247-G_1.recalibrated.haplotypeCalls.annotated.vcf

quantumbarista · September 2, 2021, 11:53pm

Thanks. Adding file:/// didn’t change the error message, so I tried both of these solutions below to make terminal do the URI encoding for me.

e.g. now the returned string looks like it’s trying to escape some problematic characters :
filename=‘file:///run/user/110911/gvfs/smb-share%3Aserver%3Dusfc…vcf’

New problem is now that filename doesn’t seem to exist. Any other suggestions or maybe a similar use case you can point me to? I’m guessing someone else must be reading VCFs from a non-local location.

FatalError: HailException: arguments refer to no files

Java stack trace:
is.hail.utils.HailException: arguments refer to no files
at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:11)
at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:11)
at is.hail.utils.package$.fatal(package.scala:77)
at is.hail.io.vcf.LoadVCF$.globAllVCFs(LoadVCF.scala:1143)
at is.hail.io.vcf.MatrixVCFReader$.apply(LoadVCF.scala:1572)
at is.hail.io.vcf.MatrixVCFReader$.fromJValue(LoadVCF.scala:1667)
at is.hail.expr.ir.MatrixReader$.fromJson(MatrixIR.scala:92)
at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1714)
at is.hail.expr.ir.IRParser$.$anonfun$matrix_ir$1(Parser.scala:1640)
at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
at is.hail.expr.ir.IRParser$.$anonfun$parse_matrix_ir$1(Parser.scala:1980)
at is.hail.expr.ir.IRParser$.parse(Parser.scala:1967)
at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1980)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$2(SparkBackend.scala:653)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$1(SparkBackend.scala:652)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.parse_matrix_ir(SparkBackend.scala:651)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

danking · September 10, 2021, 8:52pm

Hmm. I highly doubt percent-encoding the filename will work because the SMB server is probably not aware of percent encoding. Percent encoding is primarily used by HTTP servers.

I believe the issue is that Hadoop has a non-conforming implementation of RFC3986 URIs. Section 4.2 specifically notes that colons may appear in the path and that special syntax is needed for relative paths (which you do not have). In particular, there is a StackOverflow post which details four different Hadoop bugs, none of which is resolved. The root issue has been an issue since 2008, so I highly doubt Hadoop will ever fix it. I empathize with your frustration; this is a glaring and annoying limitation in Hadoop.

I think you should create a symlink, whose name includes no colons, to the desired file. Then, try to load the data through that symlink.

Also, just to be abundantly clear, I think it’s actually rather important that you use file:/run... not file:///run... The URI syntax requires that the “host” appear between the first // and the path (in your case, /run/...). The file protocol has no notion of “host”, so it is inappropriate to include an empty host as in ///run....

quantumbarista · September 15, 2021, 8:50pm

That worked! Thanks. For reference:

ln -s source_file destination_file

Topic		Replies	Views
Hail unable to see sample vcf files Help [0.1]	13	2200	July 18, 2017
HAIL 0.1: export vcf hadoop error Help [0.1]	7	1376	January 28, 2019
Import_vcf() for tutorial data fails Help [0.1]	2	1307	May 3, 2017
Hail 0.2 class not found exception on EMR Hail Query & hailctl	29	2800	August 20, 2018
Running Hail on AWS Help [0.1]	29	3761	January 9, 2019

Load vcf from SMB share

traceback:

Related topics