Hello Hail team!
I am attempting to read in a matrix table, perform a densify operation, then filter to a subset of the data and mt.write(). The filter operations are being performed quickly, as I’d expect, but the mt.write() operation has been running for over an hour and is still nowhere near done.
The original mt is a huge amount of data, count() returns (2736925182, 7783). We filter it down to just CHR 22 of ~500 of these individuals, however, before the write operation.
Is there any way to speed up the code?
import hail as hl
hl.init(default_reference=‘GRCh38’)
import argparse
import numpy as np
import pandas as pd
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()
mt = hl.read_matrix_table(‘gs://filepath.sparse.mt/’)
mt = hl.experimental.densify(mt)
filter to chr22
intervals = [‘chr22’]
mt = hl.filter_intervals(mt, [hl.parse_locus_interval(x, reference_genome=‘GRCh38’) for x in intervals])
#filter to test individuals
ppmi_ids = df[0]
samples_to_keep = set(ppmi_ids)
set_to_keep = hl.literal(samples_to_keep)
test_data = mt.filter_cols(set_to_keep.contains(mt.meta.external_id))
Save mt densified and filtered to CHR22/PPMI
test_data.write(‘gs://dataproc-staging-us-east1-942231253036-bw4veo0a/test_start.mt/’, overwrite=True)