Help with "FatalError: MethodTooLargeException" and tables or matrices with many columns

As you’ve discovered, the column data on a MatrixTable is not distributed. We have plans to be able scale up to arbitrary amounts of column data, but that’s not ready yet.

If you need to do preprocessing on the phenotypes, I recommend doing that on an MT of the phenotypes or locally with Pandas or Numpy. Once you’ve got case/control status or numbers for each phenotype, I’d convert the quantitative and case-control phenotypes into arrays:

quant_pheno_mt = quant_pheno_mt.annotate_rows(
    quant_phenos = hl.agg.collect(mt.pheno_value))
case_control_pheno_mt = case_control_pheno_mt.annotate_rows(
    case_control_phenos = hl.agg.collect(mt.pheno_value))

and annotate them on the genotype MT:

mt = mt.annotate_cols(
    quant_phenos=quant_pheno_mt[mt.s].quant_phenos,
    case_control_phenos=case_control_pheno_mt[mt.s].case_control_phenos)

You can pass linear_regression_rows an array of phenotypes:

gwas_results = hl.linear_regression_rows(
    y=mt.quant_phenos,
    x=mt.GT.n_alt_alleles(),
    covariates=[1.0, mt.PC0, ...])

I do not recommend adding a new column field for every phenotype. This is the natural thing to do, but each field incurs overhead. Treating the columns instead as an array allows Hail to use a more efficient representation.


I’m sorry you’re running into this! It’s definitely a rough part of Hail that we’re working hard to fix. Congratulations on being on the bleeding edge of Hail :wink:.

1 Like