Spark memory error trying to write matrixtable

Hi Tim et al—so I think I have tried to do what you’re suggesting, which is to annotate the columns of a matrixtable with a chunk of the phenotypes at a time, and then run a set of regressions, and then re-load the matrixtable and annotate with another chunk of phenotypes, etc. The main issue now is that it’s super slow. Any suggestions?

One thing that had come up previously is that there’s no way to quickly run logistic regression on phenotypes that all have different missingness patterns (see this post). So, that’s why I’m currently just running them one-by-one.

Here’s the (somewhat simplified) code:

This chunk (thanks to Dan from this post) runs every ~40 phenotypes (I use hl.import_table() to import ~40 phenotypes at a time)

phecodes = ht.aggregate(hl.agg.collect_as_set(ht.phecode))
ht = ht.group_by(
    ht.sample_id
).aggregate(
    phenos = hl.dict(hl.agg.collect((ht.phecode, ht.row)))
)
ht = ht.annotate(**{
    phecode: ht.phenos.get(phecode)
    for phecode in phecodes
})
mt = mt.annotate_cols(**ht[mt.col_key])

Then, for each of the 40 phenotypes, I run the following code:

regression_results = hl.logistic_regression_rows( 
    test = 'firth',
    y=mt_burden[f'{phenotype}'].has_phenotype,
    x=mt_burden.n_variants,
    covariates=[1.0,
                mt_burden.sex_float, 
                mt_burden.p21003_i0_squared,
                mt_burden.p21003_i0_float,
                mt_burden.p22009_a1,
                mt_burden.p22009_a2,
                mt_burden.p22009_a3,
                mt_burden.p22009_a4,
                mt_burden.p22009_a5,
                mt_burden.p22009_a6,
                mt_burden.p22009_a7,
                mt_burden.p22009_a8,
                mt_burden.p22009_a9,
                mt_burden.p22009_a10])

Best,
Jeremy