We’ve added the method logreg_burden to provide a very general framework for aggregating (collapsing) genotypes and then applying logistic regression (Wald, LRT, Firth, or Score). This method returns both a key table of logistic regression results and a key table of collapsed sample scores for those samples with non-missing phenotype and covariates. The latter can then be further analyzed in PySpark or locally in tools like Python and R. Click the method link above for more details.
The performance is very similar to that of linreg_burden since there are typically far fewer keys for regression than variants for aggregation; it takes just a few minutes to run logistic gene burden starting from the annotated 2353 whole genomes in the 1000 Genomes Project.
Regressions in Hail 0.2 are generic, so there are no longer special command for burden tests. Rather, to do logistic gene burden, annotate and group variants by gene to form a gene-by-sample matrix of scores, and then apply logistic regression per gene. Here’s an example that uses the total number minor alleles as a sample’s gene score (for a biallelic dataset with random phenotype and covariates):
Thank for your quick reply. Could I aggregate by variant type (e.g. silent, damaging, etc…) and perform a logistic regression based on this same example? It looks like is possible to adapt it for performing a more global burden test using log_regression.
Absolutely, once you’ve annotated variants with their type, you can use group_rows_by and aggregate to create a type by sample matrix table (for example, the entries could count the number of variants of that type per sample) and then run logistic regression to test variant type.