Gwas analyis on whole exome data without controls

jotes · May 25, 2020, 11:30pm

Hello everybody. My name is Vincent and I have a whole exome sequence data for humans(only one phenotype).
I’ll like to know SNPs and loci under selection and if they are unique to our population of study.There was not control for the study so I have been advised to compare with exome data from 1000 genomes project. I will be happy if I can be guided on what do. At the moment I have been able to call the snps using bcftools and started practicing some tutorial using hail and got some help from @danking.
Thank you in advance.

tpoterba · May 26, 2020, 12:37pm

Paging @kumarveerapen - he might be able to help here. I’m guessing that using gnomAD is your best bet here, but I’m not sure exactly how to design the study to avoid confounders.

kumarveerapen · May 27, 2020, 1:08am

Thanks for tagging me on this, Tim!

I would first like to clarify about your question, Vincent. You would like to test for SNPs that have undergone selection and are not reported in your reference population? Would you mind us asking: What is the distribution of your population ancestry?

I would also follow what Tim has suggested by annotating your variants with our Hail annotation_db for gnomAD allele frequencies in order to see what the variant frequencies are. Upon annotation, you could then which variants are relatively rare in your tested population and which are not.

jotes · May 27, 2020, 1:57pm

Thanks @tpoterba @kumarveerapen. The population is African Ancestry. Specifically Ghana

jotes · May 28, 2020, 4:38am

Hi @tpoterba @danking. I attempted annotating my data.
First I had to convert to an mt object which was successful thanks to help from @danking.
But when I try to annotate with exome_sites I get the error

Traceback (most recent call last):
File “”, line 1, in
File “”, line 2, in annotate_rows_db
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/typecheck/check.py”, line 611, in wrapper
return original_func(*args, **kwargs)
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/experimental/db.py”, line 209, in annotate_rows_db
indexed_value = dataset.index_compatible_version(rel.key)
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/experimental/db.py”, line 64, in index_compatible_version
for version in self.versions)
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/experimental/db.py”, line 62, in
index
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/experimental/db.py”, line 64, in
for version in self.versions)
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/experimental/db.py”, line 29, in maybe_index
return hl.read_table(self.url)._maybe_flexindex_table_by_expr(
File “”, line 2, in read_table
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/typecheck/check.py”, line 611, in wrapper
return original_func(*args, **kwargs)
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/methods/impex.py”, line 2359, in read_table
for rg_config in Env.backend().load_references_from_dataset(path):
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/backend/spark_backend.py”, line 327, in load_references_from_dataset
return json.loads(Env.hail().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
File “/home/vappiah/.local/lib/python3.6/site-packages/py4j/java_gateway.py”, line 1257, in call
answer, self.gateway_client, self.target_id, self.name)
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/backend/spark_backend.py”, line 40, in deco
‘Error summary: %s’ % (deepest, full, hail.version, deepest)) from None
hail.utils.java.FatalError: IOException: No FileSystem for scheme: gs

Java stack trace:
java.io.IOException: No FileSystem for scheme: gs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at is.hail.io.fs.HadoopFS.fileStatus(HadoopFS.scala:148)
at is.hail.io.fs.FS$class.isDir(FS.scala:106)
at is.hail.io.fs.HadoopFS.isDir(HadoopFS.scala:57)
at is.hail.expr.ir.RelationalSpec$.readMetadata(AbstractMatrixTableSpec.scala:31)
at is.hail.expr.ir.RelationalSpec$.readReferences(AbstractMatrixTableSpec.scala:66)
at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:587)
at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Hail version: 0.2.41-b8144dba46e6
Error summary: IOException: No FileSystem for scheme: gs

danking · May 28, 2020, 1:17pm

The annotation DB only works for Hail inside Google Cloud. Hail team maintains tools to make that easy.

If you cannot use the cloud, you can download the gnomad loci to your computer from the gnomad website directly: https://gnomad.broadinstitute.org/downloads.

Then you can do something like:

gnomad = hl.read_table('/path/to/gnomad.exomes.r2.1.1.sites.ht')
mt = mt.annotate_rows(gnomad = gnomad[mt.locus, mt.alleles])

jotes · May 29, 2020, 2:43am

Thanks @danking @tpoterba .
I was able to download the download the loci and do the annotation.

jotes · May 29, 2020, 2:44am

@kumarveerapen I was able to do the annotation with my data. Thanks

jotes · May 30, 2020, 3:31am

@tpoterba @kumarveerapen . The data I have is Ghanaian Ancestry with just 1 phenotype. Because I don’t have controls, I would like to use data from gnomeAD (african subset) as my negative control and then perform the rest of the analysis.
I have been able to download the exome data and performed the annotation as you suggested. But i have no idea how to extract only the african population from the data. If you have any other approach i am willing to try as well. Thanks for the help.

kumarveerapen · May 30, 2020, 3:45am

Thanks for tagging me, Jotes! I would be very weary about using gnomAD as a control dataset because it was curated for the purpose of describing variation and not to be used as a statistical control, especially since the individual phenotype in gnomAD is not known. Moreover, gnomAD individual level data is not publicly available.

Are there any other control data that you are able to get your hands on? Say something on dbGAP? That way, you could select a phenotype control closest to the phenotype that you are testing and have improved confidence in your variation identifcation.

jotes · May 30, 2020, 4:01am

I was planning to use data from 1000 genomes project. But I found out the vcf they have is for whole genome sequence. So I will download the raw exome sequence and call the snps. What do you think of this plan @tpoterba @kumarveerapen @danking

kumarveerapen · May 30, 2020, 4:23am

That’s a relatively better idea. But I would still implore that you attempt to find a phenotype specific control to improve interpretability of your variants.

jotes · May 30, 2020, 4:35am

Thanks @kumarveerapen. I will do that.

Topic		Replies	Views
How to add annotations to a vcf file? Hail Query & hailctl	1	625	October 31, 2018
Finding genotype for each (exome locus, sample ID) pair Hail Query & hailctl	5	534	October 30, 2018
Running exome burden test Hail Query & hailctl	0	339	February 11, 2021
Quality Control, filtering false positive variants Science	17	1881	December 6, 2018
Script for variant qc and annotation Hail Query & hailctl	2	517	October 31, 2018

Gwas analyis on whole exome data without controls

Related topics