Load vcf from SMB share

I’m trying to load VCF files from an Isilon server (since I don’t have TB of storage on my laptop).

The problem seems to be that Hadoop doesn’t like the : character in my SMB relative file path. I can open the file at this path as a file object in python so I know the path is good.

Any suggestions?

code:
filename=’/run/user/110911/gvfs/smb-share:server=usfc,share=03247-G_1.recalibrated.haplotypeCalls.annotated.vcf’

mt = hl.import_vcf(filename)

traceback:

FatalError: URISyntaxException: Relative path in absolute URI: smb-share:server=usfc

Java stack trace:
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: smb-share:server=usfc,
at org.apache.hadoop.fs.Path.initialize(Path.java:259)
at org.apache.hadoop.fs.Path.(Path.java:217)
at org.apache.hadoop.fs.Path.(Path.java:125)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:229)
at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2016)
at is.hail.io.fs.HadoopFS.glob(HadoopFS.scala:155)
at is.hail.io.fs.HadoopFS.$anonfun$globAll$1(HadoopFS.scala:134)
at is.hail.io.fs.HadoopFS.$anonfun$globAll$1$adapted(HadoopFS.scala:133)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at is.hail.io.fs.HadoopFS.globAll(HadoopFS.scala:139)
at is.hail.io.vcf.MatrixVCFReader$.apply(LoadVCF.scala:1572)
at is.hail.io.vcf.MatrixVCFReader$.fromJValue(LoadVCF.scala:1667)
at is.hail.expr.ir.MatrixReader$.fromJson(MatrixIR.scala:92)
at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1714)
at is.hail.expr.ir.IRParser$.$anonfun$matrix_ir$1(Parser.scala:1640)
at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
at is.hail.expr.ir.IRParser$.$anonfun$parse_matrix_ir$1(Parser.scala:1980)
at is.hail.expr.ir.IRParser$.parse(Parser.scala:1967)
at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1980)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$2(SparkBackend.scala:653)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$1(SparkBackend.scala:652)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.parse_matrix_ir(SparkBackend.scala:651)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Hail version: 0.2.74-0c3a74d12093

I’ve also tried following the solution here:

using:
spark-shell --conf spark.sql.warehouse.dir=file:///c:/tmp/spark-warehouse

But doesn’t change the error message

OK, so the issue here is that spark expects URIs, not file paths. In a URI, everything before the colon is the scheme. Spark thinks you’re using some scheme/protocol called /run/user/110911/gvfs/smb-share and within that scheme you want a file called server=usfc,share=03247-G_1.recalibrated.haplotypeCalls.annotated.vcf. Because this file does not begin with a /, it is a relative path.

In general, in Hadoop, if you want to use a local file path, you have to explicitly include the file: scheme. In your case, I think the correct URI is:

file:/run/user/110911/gvfs/smb-share:server=usfc,share=03247-G_1.recalibrated.haplotypeCalls.annotated.vcf

Thanks. Adding file:/// didn’t change the error message, so I tried both of these solutions below to make terminal do the URI encoding for me.

e.g. now the returned string looks like it’s trying to escape some problematic characters :
filename=‘file:///run/user/110911/gvfs/smb-share%3Aserver%3Dusfc…vcf’

New problem is now that filename doesn’t seem to exist. Any other suggestions or maybe a similar use case you can point me to? I’m guessing someone else must be reading VCFs from a non-local location.

FatalError: HailException: arguments refer to no files

Java stack trace:
is.hail.utils.HailException: arguments refer to no files
at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:11)
at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:11)
at is.hail.utils.package$.fatal(package.scala:77)
at is.hail.io.vcf.LoadVCF$.globAllVCFs(LoadVCF.scala:1143)
at is.hail.io.vcf.MatrixVCFReader$.apply(LoadVCF.scala:1572)
at is.hail.io.vcf.MatrixVCFReader$.fromJValue(LoadVCF.scala:1667)
at is.hail.expr.ir.MatrixReader$.fromJson(MatrixIR.scala:92)
at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1714)
at is.hail.expr.ir.IRParser$.$anonfun$matrix_ir$1(Parser.scala:1640)
at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
at is.hail.expr.ir.IRParser$.$anonfun$parse_matrix_ir$1(Parser.scala:1980)
at is.hail.expr.ir.IRParser$.parse(Parser.scala:1967)
at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1980)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$2(SparkBackend.scala:653)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$1(SparkBackend.scala:652)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.parse_matrix_ir(SparkBackend.scala:651)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Hmm. I highly doubt percent-encoding the filename will work because the SMB server is probably not aware of percent encoding. Percent encoding is primarily used by HTTP servers.

I believe the issue is that Hadoop has a non-conforming implementation of RFC3986 URIs. Section 4.2 specifically notes that colons may appear in the path and that special syntax is needed for relative paths (which you do not have). In particular, there is a StackOverflow post which details four different Hadoop bugs, none of which is resolved. The root issue has been an issue since 2008, so I highly doubt Hadoop will ever fix it. I empathize with your frustration; this is a glaring and annoying limitation in Hadoop.

I think you should create a symlink, whose name includes no colons, to the desired file. Then, try to load the data through that symlink.


Also, just to be abundantly clear, I think it’s actually rather important that you use file:/run... not file:///run... The URI syntax requires that the “host” appear between the first // and the path (in your case, /run/...). The file protocol has no notion of “host”, so it is inappropriate to include an empty host as in ///run....

That worked! Thanks. For reference:

ln -s source_file destination_file