Added Firth logistic regression

Hail’s logreg command now supports four versions of logistic regression for single-variant association analysis:

  • Wald test
  • Likelihood ratio test
  • Rao score test
  • Firth test

See Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants for an empirical comparison of these tests. Theoretical foundations of the Wald, likelihood ratio, and score tests may be found in Chapter 3 of Gesine Reinert’s notes on Statistical Theory. The Firth test reduces bias from small counts and resolves (quasi-)separation issues by penalizing maximum likelihood estimation by the Jeffrey’s invariant prior, as explained by Firth in Bias reduction of maximum likelihood estimates, 1993 and elaborated on by Heinze and Schemper in A solution to the problem of separation in logistic regression, 2002.

Out of curiosity, I used Hail to do a quick empirical comparison of test results and performance. I started from profile225.vds (an interval of data from the 1000 Genomes Project dataset with 2525 samples and 225k variants) and created a new version profile225.prelogreg.vds with a random binary phenotype, two random normally-distributed covariates, and hard calls only (i.e. just the gt field).

from hail import *
hc = HailContext()
    .annotate_samples_expr('sa.pheno = pcoin(0.5), sa.cov1 = rnorm(0,1), sa.cov2 = rnorm(0,1)')

I then annotated and exported variants with the results of variant QC, linear regression, and all four logistic regression tests. Note that these tests are implemented via lazy Spark transformations (namely, map on variant annotations), so they only actually run when export_variants necessitates action. In particular, with the chained command below, each core processes each of its variants though all five tests before moving onto the its next variant.

	.linreg('sa.pheno', ['sa.cov1', 'sa.cov2'])
	.logreg('wald', 'sa.pheno', ['sa.cov1', 'sa.cov2'], root='va.wald')
	.logreg('lrt', 'sa.pheno', ['sa.cov1', 'sa.cov2'], root='va.lrt')
	.logreg('score', 'sa.pheno', ['sa.cov1', 'sa.cov2'], root='va.score')
	.logreg('firth', 'sa.pheno', ['sa.cov1', 'sa.cov2'], root='va.firth')
	.export_variants('/Users/jbloom/data/logreg.tsv', 'Variant = v, va.qc.*, linBeta = va.linreg.beta, waldBeta = va.wald.beta, lrtBeta = va.lrt.beta, firthBeta = va.firth.beta, linPval = va.linreg.pval, waldPval = va.wald.pval, lrtPval = va.lrt.pval, firthPval = va.firth.pval, scorePval = va.score.pval, waldIter =, lrtIter =, firthIter ='))

I then read logreg.tsv into a Pandas dataframe in a Jupyter notebook to explore the results.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn

%matplotlib inline

df = pd.read_table('logreg.tsv')

An aside: rather than exporting to TSV, one could also go from a VariantDataset to a KeyTable with makeKeyTable and then to a Spark or Pandas DataFrame with to_dataframe or to_pandas.

Looking at beta for all variants for Firth and LRT, we see that Firth resolves quasi-separation issues and regresses beta toward the zero:

Looking at p-values for all variants, note that Firth is more conservative than LRT:

Restricting to variants with at least 20 heterozygous samples, the differences go away:


Next let’s compare the number of iterations until convergence. Note that LRT is bimodal due to quasi-separation, whereas Firth is not. When well-posed, Firth takes more iterations than LRT to converge, as expected:




We see clear evidence of mis-calibration for small counts:

Linear regression betas are extremely regressed toward zero:

However linear p-values are nearly indistinguishable to those of the score test. Here are QQ-plots comparing all four logistic methods for all variants, and for those variants with greater than 20 hets:

logreg qqplot

logreg qqplot 20het

Finally I looked at performance on my 2015 MacBook Pro. Since these tests are transformations, I added the action count to force them to run while adding negligible time, and included the command count(genotypes=True) in front to burn in an initial read of the data from disk so as to not bias against the first timed command.'profile225.prelogreg.vds').count(genotypes=True)

%time'profile225.prelogreg.vds').count(genotypes=True) # 4.17 s
%time'profile225.prelogreg.vds').count() # 1.84 s
%time'profile225.prelogreg.vds').linreg('sa.pheno', ['sa.cov1', 'sa.cov2']).count() # 7.1 s
%time'profile225.prelogreg.vds').logreg('score', 'sa.pheno', ['sa.cov1', 'sa.cov2']).count() # 51 s
%time'profile225.prelogreg.vds').logreg('lrt', 'sa.pheno', ['sa.cov1', 'sa.cov2']).count() # 92 s
%time'profile225.prelogreg.vds').logreg('wald', 'sa.pheno', ['sa.cov1', 'sa.cov2']).count() # 93 s
%time'profile225.prelogreg.vds').logreg('firth', 'sa.pheno', ['sa.cov1', 'sa.cov2']).count() # 300 s

logreg performance

No suprises here:

  • Linear regression is fastest by far.
  • Score is next, as there is no need to iteratively fit the model per variant.
  • Wald and LRT are next, as the full model must be fit per variant and convergence is quadratic.
  • Firth is slowest at 5 minutes, as both the null and full model must be fit per variant and convergence is linear.

We’ve made some additional performance improvements on linear and logistic regression since the plot above. Here is a comparison: