I’m trying to load VCF files from an Isilon server (since I don’t have TB of storage on my laptop).
The problem seems to be that Hadoop doesn’t like the : character in my SMB relative file path. I can open the file at this path as a file object in python so I know the path is good.
FatalError: URISyntaxException: Relative path in absolute URI: smb-share:server=usfc
Java stack trace:
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: smb-share:server=usfc,
at org.apache.hadoop.fs.Path.initialize(Path.java:259)
at org.apache.hadoop.fs.Path.(Path.java:217)
at org.apache.hadoop.fs.Path.(Path.java:125)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:229)
at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2016)
at is.hail.io.fs.HadoopFS.glob(HadoopFS.scala:155)
at is.hail.io.fs.HadoopFS.$anonfun$globAll$1(HadoopFS.scala:134)
at is.hail.io.fs.HadoopFS.$anonfun$globAll$1$adapted(HadoopFS.scala:133)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at is.hail.io.fs.HadoopFS.globAll(HadoopFS.scala:139)
at is.hail.io.vcf.MatrixVCFReader$.apply(LoadVCF.scala:1572)
at is.hail.io.vcf.MatrixVCFReader$.fromJValue(LoadVCF.scala:1667)
at is.hail.expr.ir.MatrixReader$.fromJson(MatrixIR.scala:92)
at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1714)
at is.hail.expr.ir.IRParser$.$anonfun$matrix_ir$1(Parser.scala:1640)
at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
at is.hail.expr.ir.IRParser$.$anonfun$parse_matrix_ir$1(Parser.scala:1980)
at is.hail.expr.ir.IRParser$.parse(Parser.scala:1967)
at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1980)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$2(SparkBackend.scala:653)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$1(SparkBackend.scala:652)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.parse_matrix_ir(SparkBackend.scala:651)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
OK, so the issue here is that spark expects URIs, not file paths. In a URI, everything before the colon is the scheme. Spark thinks you’re using some scheme/protocol called /run/user/110911/gvfs/smb-share and within that scheme you want a file called server=usfc,share=03247-G_1.recalibrated.haplotypeCalls.annotated.vcf. Because this file does not begin with a /, it is a relative path.
In general, in Hadoop, if you want to use a local file path, you have to explicitly include the file: scheme. In your case, I think the correct URI is:
Thanks. Adding file:/// didn’t change the error message, so I tried both of these solutions below to make terminal do the URI encoding for me.
e.g. now the returned string looks like it’s trying to escape some problematic characters :
filename=‘file:///run/user/110911/gvfs/smb-share%3Aserver%3Dusfc…vcf’
New problem is now that filename doesn’t seem to exist. Any other suggestions or maybe a similar use case you can point me to? I’m guessing someone else must be reading VCFs from a non-local location.
FatalError: HailException: arguments refer to no files
Java stack trace:
is.hail.utils.HailException: arguments refer to no files
at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:11)
at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:11)
at is.hail.utils.package$.fatal(package.scala:77)
at is.hail.io.vcf.LoadVCF$.globAllVCFs(LoadVCF.scala:1143)
at is.hail.io.vcf.MatrixVCFReader$.apply(LoadVCF.scala:1572)
at is.hail.io.vcf.MatrixVCFReader$.fromJValue(LoadVCF.scala:1667)
at is.hail.expr.ir.MatrixReader$.fromJson(MatrixIR.scala:92)
at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1714)
at is.hail.expr.ir.IRParser$.$anonfun$matrix_ir$1(Parser.scala:1640)
at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
at is.hail.expr.ir.IRParser$.$anonfun$parse_matrix_ir$1(Parser.scala:1980)
at is.hail.expr.ir.IRParser$.parse(Parser.scala:1967)
at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1980)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$2(SparkBackend.scala:653)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.expr.ir.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:47)
at is.hail.utils.package$.using(package.scala:627)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:46)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:275)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$1(SparkBackend.scala:652)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.parse_matrix_ir(SparkBackend.scala:651)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Hmm. I highly doubt percent-encoding the filename will work because the SMB server is probably not aware of percent encoding. Percent encoding is primarily used by HTTP servers.
I believe the issue is that Hadoop has a non-conforming implementation of RFC3986 URIs. Section 4.2 specifically notes that colons may appear in the path and that special syntax is needed for relative paths (which you do not have). In particular, there is a StackOverflow post which details four different Hadoop bugs, none of which is resolved. The root issue has been an issue since 2008, so I highly doubt Hadoop will ever fix it. I empathize with your frustration; this is a glaring and annoying limitation in Hadoop.
I think you should create a symlink, whose name includes no colons, to the desired file. Then, try to load the data through that symlink.
Also, just to be abundantly clear, I think it’s actually rather important that you use file:/run...notfile:///run... The URI syntax requires that the “host” appear between the first // and the path (in your case, /run/...). The file protocol has no notion of “host”, so it is inappropriate to include an empty host as in ///run....