Annotate variants with genes

Hello team,

I need to annotate each variants in a matrixtable with gene symbol and intervals for GRCh38. Is there a hail function to map genes with variants based on variant locus?

Thanks very much for your help!

There is no such function. Your options:

  1. Download a GRCh38 locus to gene mapping file, import as a table, and use that to annotate your matrix table.
  2. Run hl.vep using a Dataproc cluster.
  3. Use the Hail Annotation Database to annotate gencode which has gene names. The Annotation Database is only available in Amazon EMR and Google Dataproc.

@danking Thank you for the solution. I have used the option 3 you suggested to annotate the variants with genes using the gencode database.

Some of the variants in my matrixtable are not mapped to any genes, since the locus of these variants are outside of any gene intervals. We would like to find the closest gene for each of these variants that are not mapped. Is there a function in hail for finding the closest gene based on the variant locus?

Thanks very much for your help.

There is no such function in Hail. If I was going to build that, I would take the ~30,000 genes and (just in Python or w/e) evenly expand their locus intervals until the entire genome is covered by at least one gene. Then I’d import that as a Hail table and use that to annotate my data.

Thanks Dan for the suggestion.

@danking I am using the hail function below to load the gencode database on the DNAnexus platform.

gencode = hl.experimental.load_dataset(name=‘gencode’,
version=‘v31’,
reference_genome=‘GRCh38’,
region=‘us’,
cloud=‘aws’)

This function was running fine before last week. However, when I ran this function last Wednesday, I got the error below. It seems that the error is related to some AWS authentication issue in the S3 bucket in the US region (FatalError: ClassNotFoundException: com.amazonaws.auth.AWSCredentialsProvider). I didn’t have this error before, and this happened after I installed the AWS Client Workspaces on my laptop. I tried to uninstall the AWS Client Workspaces and re-ran the above hail function again, but I still got the error. Your help in resolving this issue will be greatly appreciated.


FatalError Traceback (most recent call last)
in
3 reference_genome=‘GRCh38’,
4 region=‘us’,
----> 5 cloud=‘aws’)

/opt/conda/lib/python3.6/site-packages/hail/experimental/datasets.py in load_dataset(name, version, reference_genome, region, cloud)
119 dataset = _read_dataset(path)
120 except hl.utils.java.FatalError:
→ 121 dataset = _read_dataset(path.replace(‘s3://’, ‘s3a://’))
122 else:
123 dataset = _read_dataset(path)

/opt/conda/lib/python3.6/site-packages/hail/experimental/datasets.py in _read_dataset(path)
11 return hl.read_table(path)
12 elif path.endswith(‘.mt’):
—> 13 return hl.read_matrix_table(path)
14 elif path.endswith(‘.bm’):
15 return hl.linalg.BlockMatrix.read(path)

in read_matrix_table(path, _intervals, _filter_intervals, _drop_cols, _drop_rows, _n_partitions)

/opt/conda/lib/python3.6/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
575 def wrapper(original_func, *args, **kwargs):
576 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
→ 577 return original_func(*args, **kwargs)
578
579 return wrapper

/opt/conda/lib/python3.6/site-packages/hail/methods/impex.py in read_matrix_table(path, _intervals, _filter_intervals, _drop_cols, _drop_rows, _n_partitions)
2113 :class:.MatrixTable
2114 “”"
→ 2115 for rg_config in Env.backend().load_references_from_dataset(path):
2116 hl.ReferenceGenome._from_config(rg_config)
2117

/opt/conda/lib/python3.6/site-packages/hail/backend/spark_backend.py in load_references_from_dataset(self, path)
326
327 def load_references_from_dataset(self, path):
→ 328 return json.loads(Env.hail().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
329
330 def from_fasta_file(self, name, fasta_file, index_file, x_contigs, y_contigs, mt_contigs, par):

/cluster/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in call(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
→ 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:

/opt/conda/lib/python3.6/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
29 raise FatalError(‘%s\n\nJava stack trace:\n%s\n’
30 ‘Hail version: %s\n’
—> 31 ‘Error summary: %s’ % (deepest, full, hail.version, deepest), error_id) from None
32 except pyspark.sql.utils.CapturedException as e:
33 raise FatalError(‘%s\n\nJava stack trace:\n%s\n’

FatalError: ClassNotFoundException: com.amazonaws.auth.AWSCredentialsProvider

Java stack trace:
java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:164)
at is.hail.io.fs.FS$class.isDir(FS.scala:175)
at is.hail.io.fs.HadoopFS.isDir(HadoopFS.scala:70)
at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:30)
at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:73)
at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:581)
at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)

java.lang.ClassNotFoundException: com.amazonaws.auth.AWSCredentialsProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:164)
at is.hail.io.fs.FS$class.isDir(FS.scala:175)
at is.hail.io.fs.HadoopFS.isDir(HadoopFS.scala:70)
at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:30)
at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:73)
at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:581)
at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)

Hail version: 0.2.78-b17627756568
Error summary: ClassNotFoundException: com.amazonaws.auth.AWSCredentialsProvider

If you’re running Hail through DNANexus, this might be a bug in how they’re configuring Hail / Spark (especially if it was working last week). Looks like the culprit is a missing JAR for the AWS auth functionality.

Thanks @tpoterba for the quick reply. I am using the code below before the above function call.

from pyspark.sql import SparkSession
import hail as hl
builder = (
SparkSession
.builder
.enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)

I got the message below:

pip-installed Hail requires additional configuration options in Spark referring to the path to the Hail Python module directory HAIL_DIR, e.g. /path/to/python/site-packages/hail: spark.jars=HAIL_DIR/hail-all-spark.jar spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 2.4.4 SparkUI available at http://ip-10-60-152-12.eu-west-2.compute.internal:8081 Welcome to __ __ <>__ / // /__ __/ / / __ / _ `/ / / // //_,/// version 0.2.78-b17627756568 LOGGING: writing to /opt/notebooks/hail-20230518-1658-0.2.78-b17627756568.log

Is there anything I can do in my end to resolve the error?

No, you need to contact DNANexus and ask them to fix this issue. They must’ve changed the Hail installation.

@danking Thanks for your quick reply. I will contact the DNANexus to fix this issue.