Hello Hail team,
I am trying to understand the best way to work with my sparse matrix table.
My understanding is that before performing any filters, I should use hl.experimental.densify() to convert from a sparse mt to a dense mt. When I do this, however, I go from ~1100 steps to ~69,000 steps for any count, write, and show operations. Because of this a single count() or write() will take hours to calculate.
What can I do to make this more efficient?
Thanks!
An example script might look like:
import hail as hl
import argparse
Arguements
parser = argparse.ArgumentParser()
parser.add_argument(“-f”, “–full_run”, action=“store_true”, help=“Runs on chr22 and chrX only by default. If full_run is set, it runs on the whole matrix. WARNING: This will be VERY expensive”)
parser.add_argument(“-w”, “–overwrite”, action=‘store_true’, help=“If set will overwrite output matrix if it already exists”)
requiredNamed = parser.add_argument_group(‘required named arguments’)
requiredNamed.add_argument(“-i”, “–input_mt_path”, required=True)
requiredNamed.add_argument(“-o”, “–output_mt_path”, required=True)
#requiredNamed.add_argument(“-p”, “–requester_pays_project_id”, help=“Project ID to bill to when accessing requester pays bucket, needed to access hail annotationDB”)
args = parser.parse_args()
Store Inputs
input_mt_path = args.input_mt_path
output_mt_path = args.output_mt_path
#requester_pays_project_id = args.requester_pays_project_id
read mt
mt = hl.read_matrix_table(input_mt_path)
mt = hl.experimental.densify(mt)
Save mt densified and filtered to CHR22/PPMI
mt.write(output_mt_path, overwrite=True)