I have a MatrixTable, and I plan to filter it, and then take a random sample of say 100 rows from the entire thing, and then collect the genes (i.e. mt.gene_symbol) into a python list. I plan to do this for 200 permutations. Can I please get some advice on how to implement the random sampling? Thank you.
Hi @haileyjan !
Do I understand correctly that you want to take 200 sets, each containing 100 variants, then you want to convert each variant to a gene name, finally getting 200 lists of gene names? Will you use the genotypes? What will you do with these lists of gene names?
I’m not sure Hail is the best tool for this problem if you just want a list of lists of gene names.
If you don’t have a lot of variants (say, fewer than 10M), I’d just collect the variants and do this in Python:
import random
variants = mt.rows().select('gene_symbol').collect()
samples = [
random.sample(variants, k=100)
for _ in range(0, 200)
]
If you have, say, 1B variants, this might run out of memory. You could try generating 200 sets of indexes into the matrix table rows. You’d need to use add_row_index
to get row indexes. Then you could filter to the rows that are in at least one group.
Yes that is right! I have quite a lot of variants. Can you just run through how would a single iteration go?
Let’s say I start with a mt, and I simply want to select 100 random rows into another mt, then collect the gene_symbol into a python list.