Another try to extract samples - new approach


I took a day off from Hail :=) and restarted today my efforts to extract samples by the below code. The idea is simple, get the list of sample IDs, read each sample annotation and write it to a file.
samples_IDs = vds.sample_ids =>return empty
print(num_of_samples) is 0

Please let me know if you have any idea how I can get the sample IDs list and extract all the variants and other annotations that are related to the sample
Many thanks,

from hail import *
hc = HailContext()
#vds =‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.Y.vds’)
vds =‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.vds’)

read the list of samples IDs - list of str

samples_IDs = vds.sample_ids

read number of samples

num_of_samples = vds.num_samples


Individual-level information like sample ID and genotypes are not in the gnomAD public release. The VDSes you’re reading are sites-only. They include just the variants and annotations, but do have summary statistics computed from the genotypes, like population-specific allele frequency and counts of each genotype class.


thank you. just to make sure that i understand you,
This figure columns are NOT sample (s) with all the genes variants that it is harboring (va).
Please confirm.

Many thanks,


If needed and allowed, I am happy to help with generating this kind of data. I work for the Broad and more than happy to help.
I think that a multidimensional view at samples is very important for new discoveries.



That’s the view of a typical VDS. gnomAD is a special case where all the samples have been dropped (for privacy reasons). The gnomAD team used Hail on the full genetic matrix (~40TB .vcf.bgz) to generate this releasable sites file.

Take a look at some of the public thousand genomes data we host:


Since the thousand genomes project made public all of the individual-level data, this dataset can have samples in it.


Thank you. this is very helpful