Very slow write and count operations after densify

arm · January 31, 2023, 3:37pm

Hello Hail team!

I am attempting to read in a matrix table, perform a densify operation, then filter to a subset of the data and mt.write(). The filter operations are being performed quickly, as I’d expect, but the mt.write() operation has been running for over an hour and is still nowhere near done.

The original mt is a huge amount of data, count() returns (2736925182, 7783). We filter it down to just CHR 22 of ~500 of these individuals, however, before the write operation.

Is there any way to speed up the code?

import hail as hl
hl.init(default_reference=‘GRCh38’)
import argparse

import numpy as np
import pandas as pd
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

mt = hl.read_matrix_table(‘gs://filepath.sparse.mt/’)
mt = hl.experimental.densify(mt)

filter to chr22

intervals = [‘chr22’]
mt = hl.filter_intervals(mt, [hl.parse_locus_interval(x, reference_genome=‘GRCh38’) for x in intervals])

#filter to test individuals
ppmi_ids = df[0]
samples_to_keep = set(ppmi_ids)
set_to_keep = hl.literal(samples_to_keep)
test_data = mt.filter_cols(set_to_keep.contains(mt.meta.external_id))

Save mt densified and filtered to CHR22/PPMI

test_data.write(‘gs://dataproc-staging-us-east1-942231253036-bw4veo0a/test_start.mt/’, overwrite=True)

Topic		Replies	Views
Densify() operation Hail Query & hailctl	0	319	February 1, 2023
Improve matrix write time? Hail Query & hailctl	19	782	October 29, 2019
Speed up of mt.entries().to_pandas() Hail Query & hailctl	2	812	September 11, 2020
Densify sparse mt Hail Query & hailctl	6	512	November 21, 2019
Writing my table as csv or vcf or ht takes too long Hail Query & hailctl	5	63	May 4, 2025

Very slow write and count operations after densify

filter to chr22

Save mt densified and filtered to CHR22/PPMI

Related topics