I am trying to access the data from the Pan UKBB (Hail Format | Pan UKBB). I have followed their steps, but I have an error when hail tries to access the s3 data. I have googled the error, but I have not been able to find a solution. I asked the Pan UKBB team and they have redirected me to this site.
I would like to access the data from my local machine. Until now, I have been able to access files from this s3 bucket with boto3, so the issue is likely related to the hail configuration. This is my code:
>>> from ukbb_pan_ancestry import *
>>> import hail as hl
>>> ht_idx = hl.read_table('s3://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.variant.ht')
Initializing Hail with default parameters...
2021-06-22 08:59:05 WARN Utils:69 - Your hostname, ws112610 resolves to a loopback address: 127.0.1.1; using 172.22.3.213 instead (on interface enp0s25)
2021-06-22 08:59:05 WARN Utils:69 - Set SPARK_LOCAL_IP if you need to bind to another address
2021-06-22 08:59:06 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-06-22 08:59:06 WARN Hail:43 - This Hail JAR was compiled for Spark 3.1.1, running with Spark 3.1.2.
Compatibility is not guaranteed.
Running on Apache Spark version 3.1.2
SparkUI available at http://ws112610.cm.upf.edu:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.69-6d2bd28a8849
LOGGING: writing to /home/SHARED/PROJECTS/Obesity_analysis/hail-20210622-0859-0.2.69-6d2bd28a8849.log
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<decorator-gen-1429>", line 2, in read_table
File "/home/carlos/.local/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
return __original_func(*args_, **kwargs_)
File "/home/carlos/.local/lib/python3.8/site-packages/hail/methods/impex.py", line 2457, in read_table
for rg_config in Env.backend().load_references_from_dataset(path):
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/spark_backend.py", line 326, in load_references_from_dataset
return json.loads(Env.hail().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
File "/home/carlos/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: UnsupportedFileSystemException: No FileSystem for scheme "s3"
Java stack trace:
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:164)
at is.hail.io.fs.FS.isDir(FS.scala:175)
at is.hail.io.fs.FS.isDir$(FS.scala:173)
at is.hail.io.fs.HadoopFS.isDir(HadoopFS.scala:70)
at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:30)
at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:68)
at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:596)
at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.2.69-6d2bd28a8849
Error summary: UnsupportedFileSystemException: No FileSystem for scheme "s3"
Thank you for your help! Now I am able to connect to the s3 bucket but still I am not able to access the data.
The bucket I am trying to access is a public bucket. If I run:
aws s3 ls s3://pan-ukb-us-east-1/ --no-sign-request
I can successfully list the directory. However, python3 is asking me some was configuration. I created a dummy was config file, but it is not working. I am still getting an error:
>>> ht_idx = hl.read_table('s3a://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.variant.ht')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<decorator-gen-1429>", line 2, in read_table
File "/home/carlos/.local/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
return __original_func(*args_, **kwargs_)
File "/home/carlos/.local/lib/python3.8/site-packages/hail/methods/impex.py", line 2457, in read_table
for rg_config in Env.backend().load_references_from_dataset(path):
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/spark_backend.py", line 326, in load_references_from_dataset
return json.loads(Env.hail().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
File "/home/carlos/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: B6B484W5S4YYFKW6; S3 Extended Request ID: x8+lGQjxzYpHyNwUyahr2rQ9KZ
6I9No2Xqwgrl3tmfm6jfIooSk8+URJSV9e42koEu0btG0Co/g=)
Java stack trace:
java.nio.file.AccessDeniedException: s3a://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.variant.ht: getFileStatus on s3a://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.variant.ht: com.amazonaws.services.s3
.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: B6B484W5S4YYFKW6; S3 Extended Request ID: x8+lGQjxzYpHyNwUyahr2rQ9KZ6I9No2Xqwgrl3tmfm6jfI
ooSk8+URJSV9e42koEu0btG0Co/g=), S3 Extended Request ID: x8+lGQjxzYpHyNwUyahr2rQ9KZ6I9No2Xqwgrl3tmfm6jfIooSk8+URJSV9e42koEu0btG0Co/g=:403 Forbidden
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1640)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4368)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4315)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1271)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$4(S3AFileSystem.java:1249)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:285)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:1246)
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2183)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2163)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2102)
at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:164)
at is.hail.io.fs.FS.isDir(FS.scala:175)
at is.hail.io.fs.FS.isDir$(FS.scala:173)
at is.hail.io.fs.HadoopFS.isDir(HadoopFS.scala:70)
at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:30)
at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:68)
at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:596)
at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.2.69-6d2bd28a8849
Error summary: AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: B6B484W5S4YYFKW6; S3 Extended Request ID: x8+lGQjxzYpHyNwUyahr2rQ9KZ6I9No2Xqwgrl3tmfm6jfIooSk8+URJSV9e42koEu0btG0Co/g=)
No. If I use the s3: protocol, I get the same error of the previous message:
>>> ht_idx = hl.read_table('s3://pan-ukb-us-east-1/ld_release/UKBB.EUR.ldadj.variant.ht')
Initializing Hail with default parameters...
2021-06-23 08:59:30 WARN Utils:69 - Your hostname, ws112610 resolves to a loopback address: 127.0.1.1; using 172.22.3.213 instead (on interface enp0s25)
2021-06-23 08:59:30 WARN Utils:69 - Set SPARK_LOCAL_IP if you need to bind to another address
2021-06-23 08:59:31 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-06-23 08:59:31 WARN Hail:43 - This Hail JAR was compiled for Spark 3.1.1, running with Spark 3.1.2.
Compatibility is not guaranteed.
Running on Apache Spark version 3.1.2
SparkUI available at http://ws112610.cm.upf.edu:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.69-6d2bd28a8849
LOGGING: writing to /home/carlos/hail-20210623-0859-0.2.69-6d2bd28a8849.log
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<decorator-gen-1429>", line 2, in read_table
File "/home/carlos/.local/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
return __original_func(*args_, **kwargs_)
File "/home/carlos/.local/lib/python3.8/site-packages/hail/methods/impex.py", line 2457, in read_table
for rg_config in Env.backend().load_references_from_dataset(path):
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/spark_backend.py", line 326, in load_references_from_dataset
return json.loads(Env.hail().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
File "/home/carlos/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/home/carlos/.local/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 30, in deco
raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: UnsupportedFileSystemException: No FileSystem for scheme "s3"
You might try removing everything from the spark.hadoop.fs.s3a.aws.credentials.provider except for the org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider.
Remove all but org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider from the spark config.
Both options seem to solve the problem and now I can access the s3 bucket with hail. I hope the issue is already solved. If I find another problem, I will let you know.
Error summary: IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
with the stacktrace:
Java stack trace:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:93)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy13.initialize(Unknown Source)
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at is.hail.io.fs.HadoopFS.getFileSystem(HadoopFS.scala:100)
at is.hail.io.fs.HadoopFS.glob(HadoopFS.scala:154)
at is.hail.io.fs.HadoopFS.$anonfun$globAll$1(HadoopFS.scala:136)
at is.hail.io.fs.HadoopFS.$anonfun$globAll$1$adapted(HadoopFS.scala:135)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at is.hail.io.fs.HadoopFS.globAll(HadoopFS.scala:141)
at is.hail.io.vcf.MatrixVCFReader$.apply(LoadVCF.scala:1570)
at is.hail.io.vcf.MatrixVCFReader$.fromJValue(LoadVCF.scala:1666)
at is.hail.expr.ir.MatrixReader$.fromJson(MatrixIR.scala:89)
at is.hail.expr.ir.IRParser$.matrix_ir_1(Parser.scala:1720)
at is.hail.expr.ir.IRParser$.$anonfun$matrix_ir$1(Parser.scala:1646)
at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
at is.hail.expr.ir.IRParser$.$anonfun$parse_matrix_ir$1(Parser.scala:1986)
at is.hail.expr.ir.IRParser$.parse(Parser.scala:1973)
at is.hail.expr.ir.IRParser$.parse_matrix_ir(Parser.scala:1986)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$2(SparkBackend.scala:689)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:69)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:69)
at is.hail.utils.package$.using(package.scala:638)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:58)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:308)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_matrix_ir$1(SparkBackend.scala:688)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.parse_matrix_ir(SparkBackend.scala:687)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
How could we add AWS Access Key ID and Secret Access Key into Hail?
You should not install the S3A connector on EMR. Amazon’s EMR is already configured to properly work with S3. I think you should try looking at Amazon’s documentation for using Hail on AWS.