Very slow write and count operations after densify

Hello Hail team!

I am attempting to read in a matrix table, perform a densify operation, then filter to a subset of the data and mt.write(). The filter operations are being performed quickly, as I’d expect, but the mt.write() operation has been running for over an hour and is still nowhere near done.

The original mt is a huge amount of data, count() returns (2736925182, 7783). We filter it down to just CHR 22 of ~500 of these individuals, however, before the write operation.

Is there any way to speed up the code?

import hail as hl
hl.init(default_reference=‘GRCh38’)
import argparse

import numpy as np
import pandas as pd
from bokeh.io import show, output_notebook
from bokeh.layouts import gridplot
output_notebook()

mt = hl.read_matrix_table(‘gs://filepath.sparse.mt/’)
mt = hl.experimental.densify(mt)

filter to chr22

intervals = [‘chr22’]
mt = hl.filter_intervals(mt, [hl.parse_locus_interval(x, reference_genome=‘GRCh38’) for x in intervals])

#filter to test individuals
ppmi_ids = df[0]
samples_to_keep = set(ppmi_ids)
set_to_keep = hl.literal(samples_to_keep)
test_data = mt.filter_cols(set_to_keep.contains(mt.meta.external_id))

Save mt densified and filtered to CHR22/PPMI

test_data.write(‘gs://dataproc-staging-us-east1-942231253036-bw4veo0a/test_start.mt/’, overwrite=True)