Another try to extract samples - new approach

Hi,
I took a day off from Hail :=) and restarted today my efforts to extract samples by the below code. The idea is simple, get the list of sample IDs, read each sample annotation and write it to a file.
However,
samples_IDs = vds.sample_ids =>return empty
print(num_of_samples) is 0

Please let me know if you have any idea how I can get the sample IDs list and extract all the variants and other annotations that are related to the sample
Many thanks,
eilalan


from hail import *
print(“hc”)
hc = HailContext()
print(“vds”)
#vds = hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.Y.vds’)
vds = hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.vds’)

read the list of samples IDs - list of str

print(“samples_IDs”)
samples_IDs = vds.sample_ids

read number of samples

num_of_samples = vds.num_samples
print(“num_of_samples”)
print(num_of_samples)

Individual-level information like sample ID and genotypes are not in the gnomAD public release. The VDSes you’re reading are sites-only. They include just the variants and annotations, but do have summary statistics computed from the genotypes, like population-specific allele frequency and counts of each genotype class.

thank you. just to make sure that i understand you,
This figure https://hail.is/hail/overview.html#overview-vds columns are NOT sample (s) with all the genes variants that it is harboring (va).
Please confirm.

Many thanks,
eilalan

If needed and allowed, I am happy to help with generating this kind of data. I work for the Broad and more than happy to help.
I think that a multidimensional view at samples is very important for new discoveries.

Best,
eilalan

That’s the view of a typical VDS. gnomAD is a special case where all the samples have been dropped (for privacy reasons). The gnomAD team used Hail on the full genetic matrix (~40TB .vcf.bgz) to generate this releasable sites file.

Take a look at some of the public thousand genomes data we host:

gs://hail-1kg/ALL.1KG.try3.vds/

Since the thousand genomes project made public all of the individual-level data, this dataset can have samples in it.

Thank you. this is very helpful