Gwas analyis on whole exome data without controls

Hello everybody. My name is Vincent and I have a whole exome sequence data for humans(only one phenotype).
I’ll like to know SNPs and loci under selection and if they are unique to our population of study.There was not control for the study so I have been advised to compare with exome data from 1000 genomes project. I will be happy if I can be guided on what do. At the moment I have been able to call the snps using bcftools and started practicing some tutorial using hail and got some help from @danking.
Thank you in advance.

Paging @kumarveerapen - he might be able to help here. I’m guessing that using gnomAD is your best bet here, but I’m not sure exactly how to design the study to avoid confounders.

Thanks for tagging me on this, Tim! :slight_smile:

I would first like to clarify about your question, Vincent. You would like to test for SNPs that have undergone selection and are not reported in your reference population? Would you mind us asking: What is the distribution of your population ancestry?

I would also follow what Tim has suggested by annotating your variants with our Hail annotation_db for gnomAD allele frequencies in order to see what the variant frequencies are. Upon annotation, you could then which variants are relatively rare in your tested population and which are not.

Thanks @tpoterba @kumarveerapen. The population is African Ancestry. Specifically Ghana

Hi @tpoterba @danking. I attempted annotating my data.
First I had to convert to an mt object which was successful thanks to help from @danking.
But when I try to annotate with exome_sites I get the error

Traceback (most recent call last):
File “”, line 1, in
File “”, line 2, in annotate_rows_db
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/typecheck/”, line 611, in wrapper
return original_func(*args, **kwargs)
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/experimental/”, line 209, in annotate_rows_db
indexed_value = dataset.index_compatible_version(rel.key)
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/experimental/”, line 64, in index_compatible_version
for version in self.versions)
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/experimental/”, line 62, in
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/experimental/”, line 64, in
for version in self.versions)
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/experimental/”, line 29, in maybe_index
return hl.read_table(self.url)._maybe_flexindex_table_by_expr(
File “”, line 2, in read_table
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/typecheck/”, line 611, in wrapper
return original_func(*args, **kwargs)
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/methods/”, line 2359, in read_table
for rg_config in Env.backend().load_references_from_dataset(path):
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/backend/”, line 327, in load_references_from_dataset
return json.loads(Env.hail().variant.ReferenceGenome.fromHailDataset(self.fs._jfs, path))
File “/home/vappiah/.local/lib/python3.6/site-packages/py4j/”, line 1257, in call
answer, self.gateway_client, self.target_id,
File “/home/vappiah/.local/lib/python3.6/site-packages/hail/backend/”, line 40, in deco
‘Error summary: %s’ % (deepest, full, hail.version, deepest)) from None IOException: No FileSystem for scheme: gs

Java stack trace: No FileSystem for scheme: gs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(
at org.apache.hadoop.fs.FileSystem.createFileSystem(
at org.apache.hadoop.fs.FileSystem.access$200(
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
at org.apache.hadoop.fs.FileSystem$Cache.get(
at org.apache.hadoop.fs.FileSystem.get(
at org.apache.hadoop.fs.Path.getFileSystem(
at is.hail.variant.ReferenceGenome$.fromHailDataset(ReferenceGenome.scala:587)
at is.hail.variant.ReferenceGenome.fromHailDataset(ReferenceGenome.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at py4j.reflection.MethodInvoker.invoke(
at py4j.reflection.ReflectionEngine.invoke(
at py4j.Gateway.invoke(
at py4j.commands.AbstractCommand.invokeMethod(
at py4j.commands.CallCommand.execute(

Hail version: 0.2.41-b8144dba46e6
Error summary: IOException: No FileSystem for scheme: gs

The annotation DB only works for Hail inside Google Cloud. Hail team maintains tools to make that easy.

If you cannot use the cloud, you can download the gnomad loci to your computer from the gnomad website directly:

Then you can do something like:

gnomad = hl.read_table('/path/to/')
mt = mt.annotate_rows(gnomad = gnomad[, mt.alleles])
1 Like

Thanks @danking @tpoterba .
I was able to download the download the loci and do the annotation.

@kumarveerapen I was able to do the annotation with my data. Thanks

@tpoterba @kumarveerapen . The data I have is Ghanaian Ancestry with just 1 phenotype. Because I don’t have controls, I would like to use data from gnomeAD (african subset) as my negative control and then perform the rest of the analysis.
I have been able to download the exome data and performed the annotation as you suggested. But i have no idea how to extract only the african population from the data. If you have any other approach i am willing to try as well. Thanks for the help.

Thanks for tagging me, Jotes! I would be very weary about using gnomAD as a control dataset because it was curated for the purpose of describing variation and not to be used as a statistical control, especially since the individual phenotype in gnomAD is not known. Moreover, gnomAD individual level data is not publicly available.

Are there any other control data that you are able to get your hands on? Say something on dbGAP? That way, you could select a phenotype control closest to the phenotype that you are testing and have improved confidence in your variation identifcation.

I was planning to use data from 1000 genomes project. But I found out the vcf they have is for whole genome sequence. So I will download the raw exome sequence and call the snps. What do you think of this plan @tpoterba @kumarveerapen @danking

That’s a relatively better idea. But I would still implore that you attempt to find a phenotype specific control to improve interpretability of your variants.

Thanks @kumarveerapen. I will do that.