Hi @danking,
Sorry for the late reply. We are indeed converting BGEN to Matrix, and clearly this is the issue. How would one go about storing BGEN files on the cloud without the need to localize the data for analysis? For example, we want to extract a subset of SNPs from UKB for use with a PRS. To do so, I don’t want to localize the BGEN files from the bucket everytime I want to extract SNPs. Given this, we tried converting the BEN to Matrix and store the matrix on the cloud and query these Matrix via Hail. Here is the code to build the Matrix Tables:
#!/usr/bin/python3
# sys.argv[1] = chromosome
# sys.argv[2] = chromosome specfic bgen file
# sys.argv[3] = path to sample file
import hail, sys
bgen_file = "%s" % (sys.argv[2])
sample_file = "%s" % (sys.argv[3])
mt_file = "%s.mt" % (sys.argv[1])
hail.index_bgen(bgen_file,
contig_recoding={"01": "1", "02": "2", "03": "3", "04": "4", "05": "5", "06": "6", "07": "7", "08": "8", "09": "9"},
reference_genome='GRCh37')
d = hail.methods.import_bgen(bgen_file,
entry_fields=['dosage'],
sample_file=sample_file)
d.write(mt_file)
We then copy the Matrix tables to a bucket.
We then query the matrix tables like so:
import hail as hl, pandas as pd
### function to extract snps
def extract_snps(chrom):
mt_file = "gs://..." + str(chrom) + ".mt"
hl_mt = hl.read_matrix_table(mt_file)
# get rsids for specific chromosome
prs_data_chrom = prs_data.loc[prs_data['chr_name'] == chrom]
rsids = prs_data_chrom.rsID.values
set_to_keep = hl.literal(set(rsids))
hl_mt_filt = hl_mt.filter_rows(set_to_keep.contains(hl_mt['rsid']))
hl_mt_filt.count()
return hl_mt_filt
### function to create prs
def create_prs(hl_mt, prs_data, chrom):
prs_data_22 = prs_data.loc[prs_data['chr_name'] == chrom]
prs_data_22_ea = prs_data_22[['effect_allele', 'rsID']]
prs_data_22_ea.rename(columns={'rsID':'rsid'}, inplace=True)
prs_data_mt = hl.Table.from_pandas(prs_data_22_ea, key = 'rsid')
hl_mt = hl_mt.annotate_rows(**prs_data_mt[hl_mt.rsid])
prs_data_22_weight = prs_data_22[['effect_weight', 'rsID']]
prs_data_22_weight.rename(columns={'rsID':'rsid'}, inplace=True)
prs_data_weight_mt = hl.Table.from_pandas(prs_data_22_weight, key = 'rsid')
hl_mt = hl_mt.annotate_rows(**prs_data_weight_mt[hl_mt.rsid])
flip = hl.case().when(hl_mt.effect_allele == hl_mt.alleles[0], True).when(hl_mt.effect_allele == hl_mt.alleles[1], False).or_missing()
hl_mt = hl_mt.annotate_rows(flip = flip)
hl_mt = hl_mt.filter_rows(hl.is_defined(hl_mt.flip))
hl_mt = hl_mt.annotate_cols(prs=hl.agg.sum(hl_mt.effect_weight * hl.if_else(hl_mt.flip, 2 - hl_mt.dosage, hl_mt.dosage)))
hl_mt.cols().export(("%s_mt_prs.txt.bgz") % (chrom))
### running extraction
hl.init(min_block_size=128)
prs = 'prs.txt'
prs_data = pd.read_csv(prs, sep = "\t", header = 0)
import timeit
chrom = 3
start = timeit.default_timer()
hl_mt = extract_snps(chrom)
hl_mt.entries().export(("%s_genotype_dosage.tsv.bgz") % (chrom))
hl_mt.write(("%s_biomarin_prs_snps.mt") % (chrom))
end = timeit.default_timer()
print(end-start)
start = timeit.default_timer()
create_prs(hl_mt, prs_data, chrom)
end = timeit.default_timer()
print(end-start)
del hl_mt
The above code can clearly be improved, but based on your feedback seems like we have yet to fully grok how to use Hail effectively.
So, how do you recommend we query UK Biobank BGEN files to avoid localization of entire chromosomes to a VM for each extraction?
Thanks in advance for your help. I am keen to learn more about Hail, so if you can also point me to additional documentation that may be helpful that would also be great.
Vince