Hello,
I’m working on outputting genotype (GT
) calls for each sample, but I’m hitting some issues with handling the matrix table. To avoid conflicts, I’ve ended up using .collect()
to bring the entire table into memory, which isn’t ideal due to high memory consumption.
My script now:
sample = col.s
(Filter the MatrixTable for the specific sample)
sample_mt = self.mt.filter_cols(self.mt.s == sample)
(Extract the GT column,Table handle this way to avoid structure conflict)
sample_entries_table = sample_mt.entries()
sample_entries_table = sample_entries_table.select(‘GT’)
(Convert the entries (GT field) to a simple array without keys)
gt_array = sample_mt.entries().select(‘GT’).collect()
As I previously used with export():
(This outputs extra columns (locus, alleles, and s))
sample_entries_table = sample_mt.entries().select(‘GT’)
sample_entries_table.export(output_file)
I am not aware if this is the key column issue or function handling problem, but I feel that there is a more efficient way to extract the individual GT calls without using collect(). Any guide would be appreciated.
Hi @cchunju8286,
Would you mind sharing what you want to do with the result? Maybe we can give you a better answer with more details.
You can omit the extra fields and globals in Table.export
by dropping the key:
>>> mt = sample_mt.entries().key_by().select_globals().select('GT')
>>> mt.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'GT': call
----------------------------------------
Key: []
----------------------------------------
Hope this helps,
Hi @ehigham ,
sorry for the late response, I was focusing on other projects. The method you told me did work, and it solved the problem. The output I wanted is simply for each individual (like below):
GT
0/1
0/0
0/0
If you don’t mind I wanted to ask another question, cause I tried to parallelize the task, and it returns “joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.”
I think what I understand is that this particular Hail matrix can not pass on to multiple tasks at the same time? I am also trying to avoid reading the matrix multiple times, so I would not overload the memory during the processing. Have you ever encounter this problem, or have an idea of this issue?
My code in the main script:
def process_sample(sample):
# Filter MatrixTable for a specific sample
sample_mt = mt.filter_cols(mt.s == sample)
output_sample_gt(sample_mt, sample, chr, output_dir)
# Parallelize the processing
Parallel(n_jobs=n_jobs)(
delayed(process_sample)(sample) for sample in sample_ids
)
Let me know if the problem is clear.
Thanks again