HWE p-values differ from Plink


I’ve been comparing statistics gathered from plink and I noticed that many variants did not match up. I can attribute some of this to bilallelic vs. >2 alt alleles. But when I look at only biallelic sites, there are many sites that plink has a pHWE = 1, while the pHWE calculated using Hail is much different (and varies significantly). Any idea why this is happening?


Hi Rebecca, thanks for writing! Could you help us reproduce this on our end? In particular, what are the Plink and Hail scripts you are running? And is there an example on public or synthetic data showing the discrepancies that we can investigate?

pHWE = 1 usually indicates an edge case involving zeroes, so it’s certainly possible we handle these differently. We fixed an issue with the Fisher Exact Test last week whereby we had forgotten to check whether either column in the 2 x 2 array was all zero (we were only checking the rows). In that case, Hail now just returns 1.

Here’s the hail script I used (sorry, it wouldn’t let me attach it as a file):

vds = vds.split_multi()
vds = vds.annotate_samples_table('/work-zfs/darking1/active/projects/BAZ2B/SureSelect/Sequencing/variant.filtering/targetedregions/allsamples.pheno.updated.txt',
vds_vCR = (vds.sample_qc())
vds_vCR.export_samples('filteredsampleqc.txt', 'Sample = s.id, sa.qc.*')
vds_vCR = (vds.sample_qc())
vds_gAB_sCR_sGQ = (vds
    .annotate_samples_vds(vds_vCR, code = 'sa.qc = vds.qc' )
    .filter_samples_expr('sa.qc.gqMean >= 20'))

##variant QC

vds_gAB_sCR_sGQ = vds_gAB_sCR_sGQ.variant_qc()
variantqc_table = vds_gAB_sCR_sGQ.variants_keytable().to_pandas()
vds_gAD_sCR_sGQ_vGQ= vds_gAB_sCR_sGQ.filter_variants_expr('va.qc.gqMean >= 20')	
vds_gAD_sCR_sGQ_vGQ.export_variants('filteredvariants.2.txt', 'Variant=va.rsid, Chr=v.contig, Start=v.start, Ref=v.ref, Alt.alleles=v.altAlleles, N.alleles=v.nAltAlleles(), va.qc.*')

Plink was simply:

plink --vcf ../ALL.pos.recode.vcf --keep hailsamples.txt --hardy --make-bed --out all.hail

I used the same samples and same variants for both programs. I also didn’t filter genotypes based on allelic balance, in order to be as consistent as possible between the two programs.

I’m attaching one plot, one showing the comparison of pHWE from Hail vs. pHWE from plink restricted to only biallelic variants. (I also have the same plot for all variants, but the forum will only allow me to upload one plot).

I can also send you the summary files from hail and plink if that would help at all. I’m doing this on ~1600 samples.

That’s a great plot! Based on it, I’m confident the difference is due to how Plink and Hail calculate pHWE.

Our documentation of the pHWE function includes this write up describing the math of our implementation with references. But we do not make clear that we indeed use mid-p correction on the exact test, which Plink does not. I’ll update the docs accordingly.

With this correction, the right, left, and exact tests sum half the probability of the observed outcome and the probabilities of “more extreme” outcomes (greater value, lesser value, or smaller probability, respectively). In the exact test, outcomes with probability equal to the observed outcome are also weighted by 1/2. We found this to be a more sensible procedure for p-values on this discrete null distribution.

While the difference between 1 and .5 is large, for small p-values the results should be very similar. This convergence in apparent in the plot above, and you could investigate further with a log-log plot zooming in near (0,0) above. If you make that plot (and there are actually points down there), it’d be awesome if you could post it here as well.

Yup. I realized today that plink has a ‘midp’ modifier that specifies a mid-p adjustment. When I did that (and removed any variant with >2 alleles), the p-values were perfected correlated. See attached plot.

Thanks for your help! Looking forward to using Hail.