Spark memory error trying to write matrixtable

jbchang · January 11, 2023, 6:01am

Hi Tim et al—so I think I have tried to do what you’re suggesting, which is to annotate the columns of a matrixtable with a chunk of the phenotypes at a time, and then run a set of regressions, and then re-load the matrixtable and annotate with another chunk of phenotypes, etc. The main issue now is that it’s super slow. Any suggestions?

One thing that had come up previously is that there’s no way to quickly run logistic regression on phenotypes that all have different missingness patterns (see this post). So, that’s why I’m currently just running them one-by-one.

Here’s the (somewhat simplified) code:

This chunk (thanks to Dan from this post) runs every ~40 phenotypes (I use hl.import_table() to import ~40 phenotypes at a time)

phecodes = ht.aggregate(hl.agg.collect_as_set(ht.phecode))
ht = ht.group_by(
    ht.sample_id
).aggregate(
    phenos = hl.dict(hl.agg.collect((ht.phecode, ht.row)))
)
ht = ht.annotate(**{
    phecode: ht.phenos.get(phecode)
    for phecode in phecodes
})
mt = mt.annotate_cols(**ht[mt.col_key])

Then, for each of the 40 phenotypes, I run the following code:

regression_results = hl.logistic_regression_rows( 
    test = 'firth',
    y=mt_burden[f'{phenotype}'].has_phenotype,
    x=mt_burden.n_variants,
    covariates=[1.0,
                mt_burden.sex_float, 
                mt_burden.p21003_i0_squared,
                mt_burden.p21003_i0_float,
                mt_burden.p22009_a1,
                mt_burden.p22009_a2,
                mt_burden.p22009_a3,
                mt_burden.p22009_a4,
                mt_burden.p22009_a5,
                mt_burden.p22009_a6,
                mt_burden.p22009_a7,
                mt_burden.p22009_a8,
                mt_burden.p22009_a9,
                mt_burden.p22009_a10])

Best,
Jeremy

Topic		Replies	Views
Error when writing a large VEP annotated Hail Table Hail Query & hailctl	2	549	March 6, 2023
ClassFormatError when writing matrixtable Hail Query & hailctl	4	342	December 21, 2022
Hail 0.2 - Attaching MatrixTable with phenotypes and getting an error Hail Query & hailctl	7	556	April 20, 2018
Issues writing matrix table from filtered pVCF (UK Biobank data) Hail Query & hailctl	1	456	September 22, 2022
PCA job aborted from SparkException Hail Query & hailctl	46	2660	July 28, 2020

Spark memory error trying to write matrixtable

Related topics