Sure, below in bold is the relevant code and a bit of the downstream steps I have. I have successfully ran this workflow for 3 genes before, but struggling with larger numbers.
#intervals is of length 3472 to be exact. Each interval corresponds to a particular gene of interest.
intervals = [‘chr7:113116717-113118553’, ‘chr7:113876776-113919008’, ‘chr16:53703962-54121940’…]
mt = hl.filter_intervals(mt, [hl.parse_locus_interval(x) for x in intervals[:1000]], keep=True)
→ I plan to filter the columns based on a list of samples I already have
mt = mt.semi_join_cols(samples)
→ Next I encode
mt_encoded = mt.annotate_entries(n_alt_alleles = mt.GT.n_alt_alleles())
→ Next I flatten the table such that I keep only the number of alternate alleles, variants in rows and my samples in columns
entries_table = mt_encoded.entries()
entries_table = entries_table.key_by()
→ Annotate a unique variant identifier (combining locus and alleles)
entries_table = entries_table.annotate(variant=hl.str(entries_table.locus) + “_” + hl.delimit(entries_table.alleles, “,”))
→ Select the relevant fields (sample ID, variant, and n_alt_alleles)
entries_table = entries_table.select(entries_table.s, entries_table.variant, entries_table.n_alt_alleles)
→ exporting into workspace bucket
entries_table.export(f’{bucket}/data/n_alt_alleles.csv’)
Note: this last step of exporting into my bucket is also taking a very long time (been waiting for 1.5 hours now), wondering if you have any thoughts on that.