Gene-based GWAS

Hi!
Here I’m trying to perform gene-based GWAS, but after obtaining aggregated information entries fields was excluded, and I guess there should be a less complicated approach to do it.

mt.annotate_rows(\
            non_miss = hl.agg.group_by(mt.pheno.cases_controls, hl.agg.count_where(hl.is_defined(mt.GT))),
            hom_ref = hl.agg.group_by(mt.pheno.cases_controls, hl.agg.count_where(mt.GT.is_hom_ref())),
            hom_var = hl.agg.group_by(mt.pheno.cases_controls, hl.agg.count_where(mt.GT.is_hom_var())),
            het = hl.agg.group_by(mt.pheno.cases_controls, hl.agg.count_where(mt.GT.is_het()))
                )

mt_gene_based = mt.group_rows_by(gene = mt.kept_transcripts[3])\ # gene field from CSQ field
                        .aggregate_rows(non_miss_gene_cases = hl.agg.sum(mt.non_miss['case']),
                                        non_miss_gene_controls = hl.agg.sum(mt.non_miss['control']),
                                        hom_ref_gene_cases = hl.agg.sum(mt.hom_ref['case']),
                                        hom_ref_gene_controls = hl.agg.sum(mt.hom_ref['control']),
                                        hom_var_gene_cases = hl.agg.sum(mt.hom_var['case']),
                                        hom_var_gene_controls = hl.agg.sum(mt.hom_var['control']),
                                        het_gene_cases = hl.agg.sum(mt.het['case']),
                                        het_gene_controls = hl.agg.sum(mt.het['control']),
                                       ).result()

p.s. there may be a more compact way to do the same.

a group_rows_by() / aggregate_rows / result is a somewhat degenerate operation on a matrix table – it ignores the entries, as you point out.

You can do the following:

mt_gene_based = mt.group_rows_by(gene = ...).aggregate(
    non_miss = hl.agg.count_where(hl.is_defined(mt.GT)),
    hom_ref = hl.agg.count_where(mt.GT.is_hom_ref()),
    het = hl.agg.count_where(mt.GT.is_het()),
    hom_var = hl.agg.count_where(mt.GT.is_hom_var()),
)

Now you’ve got a gene x sample matrix table, and you can do a hl.{linear, logistic, poisson}_regression_rows as normal.

Thanks! Now all is fine.