Another try to extract samples - new approach

eilalan · May 16, 2017, 9:43pm

Hi,
I took a day off from Hail :=) and restarted today my efforts to extract samples by the below code. The idea is simple, get the list of sample IDs, read each sample annotation and write it to a file.
However,
samples_IDs = vds.sample_ids =>return empty
print(num_of_samples) is 0

Please let me know if you have any idea how I can get the sample IDs list and extract all the variants and other annotations that are related to the sample
Many thanks,
eilalan

from hail import *
print(“hc”)
hc = HailContext()
print(“vds”)
#vds = hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.Y.vds’)
vds = hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.vds’)

read the list of samples IDs - list of str

print(“samples_IDs”)
samples_IDs = vds.sample_ids

read number of samples

num_of_samples = vds.num_samples
print(“num_of_samples”)
print(num_of_samples)

tpoterba · May 16, 2017, 9:59pm

Individual-level information like sample ID and genotypes are not in the gnomAD public release. The VDSes you’re reading are sites-only. They include just the variants and annotations, but do have summary statistics computed from the genotypes, like population-specific allele frequency and counts of each genotype class.

eilalan · May 16, 2017, 10:42pm

thank you. just to make sure that i understand you,
This figure https://hail.is/hail/overview.html#overview-vds columns are NOT sample (s) with all the genes variants that it is harboring (va).
Please confirm.

Many thanks,
eilalan

eilalan · May 16, 2017, 10:48pm

If needed and allowed, I am happy to help with generating this kind of data. I work for the Broad and more than happy to help.
I think that a multidimensional view at samples is very important for new discoveries.

Best,
eilalan

tpoterba · May 16, 2017, 10:57pm

That’s the view of a typical VDS. gnomAD is a special case where all the samples have been dropped (for privacy reasons). The gnomAD team used Hail on the full genetic matrix (~40TB .vcf.bgz) to generate this releasable sites file.

Take a look at some of the public thousand genomes data we host:

gs://hail-1kg/ALL.1KG.try3.vds/

Since the thousand genomes project made public all of the individual-level data, this dataset can have samples in it.

eilalan · May 16, 2017, 11:03pm

Thank you. this is very helpful

Topic		Replies	Views
Finding genotype for each (exome locus, sample ID) pair Hail Query & hailctl	5	528	October 30, 2018
Annotating samples with a specific genotype dosage Help [0.1]	7	976	November 17, 2017
New Python features; print_schema and show_globals removed Updates	0	778	January 28, 2017
Looking for specific number of samples in gnomAD 2.1 Science	2	449	August 18, 2021
Applying gnomAD Ancestry Methods to other Data Hail Query & hailctl	2	461	August 2, 2021

Another try to extract samples - new approach

read the list of samples IDs - list of str

read number of samples

Related topics