I am trying to access the data from the Pan UKBB (Hail Format | Pan UKBB). I have followed their steps, but I have an error when hail tries to access the s3 data. I have googled the error, but I have not been able to find a solution. I asked the Pan UKBB team and they have redirected me to this site.
I would like to access the data from my local machine. Until now, I have been able to access files from this s3 bucket with boto3, so the issue is likely related to the hail configuration. This is my code:
>>> from ukbb_pan_ancestry import *
>>> import hail as hl
>>> ht_idx = hl.read_table('s3://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.variant.ht')
Initializing Hail with default parameters...
2021-06-22 08:59:05 WARN Utils:69 - Your hostname, ws112610 resolves to a loopback address: 127.0.1.1; using 172.22.3.213 instead (on interface enp0s25)
2021-06-22 08:59:05 WARN Utils:69 - Set SPARK_LOCAL_IP if you need to bind to another address
2021-06-22 08:59:06 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-06-22 08:59:06 WARN Hail:43 - This Hail JAR was compiled for Spark 3.1.1, running with Spark 3.1.2.
Compatibility is not guaranteed.
Running on Apache Spark version 3.1.2
SparkUI available at http://ws112610.cm.upf.edu:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.69-6d2bd28a8849
LOGGING: writing to /home/SHARED/PROJECTS/Obesity_analysis/hail-20210622-0859-0.2.69-6d2bd28a8849.log
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<decorator-gen-1429>", line 2, in read_table
File "/home/carlos/.local/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
return __original_func(*args_, **kwargs_)
File "/home/carlos/.local/lib/python3.8/site-packages/hail/methods/impex.py", line 2457, in read_table
for rg_config in Env.backend().load_references_from_dataset(path):
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/spark_backend.py", line 326, in load_references_from_dataset
return json.loads(Env.hail().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
File "/home/carlos/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: UnsupportedFileSystemException: No FileSystem for scheme "s3"
Java stack trace:
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:164)
at is.hail.io.fs.FS.isDir(FS.scala:175)
at is.hail.io.fs.FS.isDir$(FS.scala:173)
at is.hail.io.fs.HadoopFS.isDir(HadoopFS.scala:70)
at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:30)
at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:68)
at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:596)
at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.2.69-6d2bd28a8849
Error summary: UnsupportedFileSystemException: No FileSystem for scheme "s3"
Thank you for your help! Now I am able to connect to the s3 bucket but still I am not able to access the data.
The bucket I am trying to access is a public bucket. If I run:
aws s3 ls s3://pan-ukb-us-east-1/ --no-sign-request
I can successfully list the directory. However, python3 is asking me some was configuration. I created a dummy was config file, but it is not working. I am still getting an error:
>>> ht_idx = hl.read_table('s3a://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.variant.ht')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<decorator-gen-1429>", line 2, in read_table
File "/home/carlos/.local/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
return __original_func(*args_, **kwargs_)
File "/home/carlos/.local/lib/python3.8/site-packages/hail/methods/impex.py", line 2457, in read_table
for rg_config in Env.backend().load_references_from_dataset(path):
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/spark_backend.py", line 326, in load_references_from_dataset
return json.loads(Env.hail().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
File "/home/carlos/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: B6B484W5S4YYFKW6; S3 Extended Request ID: x8+lGQjxzYpHyNwUyahr2rQ9KZ
6I9No2Xqwgrl3tmfm6jfIooSk8+URJSV9e42koEu0btG0Co/g=)
Java stack trace:
java.nio.file.AccessDeniedException: s3a://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.variant.ht: getFileStatus on s3a://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.variant.ht: com.amazonaws.services.s3
.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: B6B484W5S4YYFKW6; S3 Extended Request ID: x8+lGQjxzYpHyNwUyahr2rQ9KZ6I9No2Xqwgrl3tmfm6jfI
ooSk8+URJSV9e42koEu0btG0Co/g=), S3 Extended Request ID: x8+lGQjxzYpHyNwUyahr2rQ9KZ6I9No2Xqwgrl3tmfm6jfIooSk8+URJSV9e42koEu0btG0Co/g=:403 Forbidden
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1640)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4368)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4315)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1271)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$4(S3AFileSystem.java:1249)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1246)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2183)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2163)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2102)
at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:164)
at is.hail.io.fs.FS.isDir(FS.scala:175)
at is.hail.io.fs.FS.isDir$(FS.scala:173)
at is.hail.io.fs.HadoopFS.isDir(HadoopFS.scala:70)
at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:30)
at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:68)
at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:596)
at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.2.69-6d2bd28a8849
Error summary: AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: B6B484W5S4YYFKW6; S3 Extended Request ID: x8+lGQjxzYpHyNwUyahr2rQ9KZ6I9No2Xqwgrl3tmfm6jfIooSk8+URJSV9e42koEu0btG0Co/g=)
No. If I use the s3: protocol, I get the same error of the previous message:
>>> ht_idx = hl.read_table('s3://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.variant.ht')
Initializing Hail with default parameters...
2021-06-23 08:59:30 WARN Utils:69 - Your hostname, ws112610 resolves to a loopback address: 127.0.1.1; using 172.22.3.213 instead (on interface enp0s25)
2021-06-23 08:59:30 WARN Utils:69 - Set SPARK_LOCAL_IP if you need to bind to another address
2021-06-23 08:59:31 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-06-23 08:59:31 WARN Hail:43 - This Hail JAR was compiled for Spark 3.1.1, running with Spark 3.1.2.
Compatibility is not guaranteed.
Running on Apache Spark version 3.1.2
SparkUI available at http://ws112610.cm.upf.edu:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.69-6d2bd28a8849
LOGGING: writing to /home/carlos/hail-20210623-0859-0.2.69-6d2bd28a8849.log
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<decorator-gen-1429>", line 2, in read_table
File "/home/carlos/.local/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
return __original_func(*args_, **kwargs_)
File "/home/carlos/.local/lib/python3.8/site-packages/hail/methods/impex.py", line 2457, in read_table
for rg_config in Env.backend().load_references_from_dataset(path):
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/spark_backend.py", line 326, in load_references_from_dataset
return json.loads(Env.hail().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
File "/home/carlos/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: UnsupportedFileSystemException: No FileSystem for scheme "s3"
You might try removing everything from the spark.hadoop.fs.s3a.aws.credentials.provider except for the org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider.
Remove all but org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider from the spark config.
Both options seem to solve the problem and now I can access the s3 bucket with hail. I hope the issue is already solved. If I find another problem, I will let you know.
Error summary: IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
with the stacktrace:
Java stack trace:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:93)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy13.initialize(Unknown Source)
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at is.hail.io.fs.HadoopFS.getFileSystem(HadoopFS.scala:100)
at is.hail.io.fs.HadoopFS.glob(HadoopFS.scala:154)
at is.hail.io.fs.HadoopFS.$anonfun$globAll$1(HadoopFS.scala:136)
at is.hail.io.fs.HadoopFS.$anonfun$globAll$1$adapted(HadoopFS.scala:135)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at is.hail.io.fs.HadoopFS.globAll(HadoopFS.scala:141)
at is.hail.io.vcf.MatrixVCFReader$.apply(LoadVCF.scala:1570)
at is.hail.io.vcf.MatrixVCFReader$.fromJValue(LoadVCF.scala:1666)
at is.hail.expr.ir.MatrixReader$.fromJson(MatrixIR.scala:89)
at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1720)
at is.hail.expr.ir.IRParser$.$anonfun$matrix_ir$1(Parser.scala:1646)
at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
at is.hail.expr.ir.IRParser$.$anonfun$parse_matrix_ir$1(Parser.scala:1986)
at is.hail.expr.ir.IRParser$.parse(Parser.scala:1973)
at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1986)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$2(SparkBackend.scala:689)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:69)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:69)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:58)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:308)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$1(SparkBackend.scala:688)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.parse_matrix_ir(SparkBackend.scala:687)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
How could we add AWS Access Key ID and Secret Access Key into Hail?