Logistic regression implementation


I am new to hail and trying to implement binary trait in hail.

The set up is
1, The phenotype is in 1, 0 format
2, I imported plink file using hl.import_plink and set quant_pheno= False and set it to mt
3, I imported covariate using hl.import_table and use annotate_cols to incorporate covariates to the mt
4, last I did something like, gwas =hl.logistic_regression_rows(test=‘wald’,
y=[mt.is_case, mt.is_case],
, covariates= [1, mt.covar.1, mt.covar.2, …])
1, Why is my is case in bool format?
Screen Shot 2020-09-22 at 10.24.25 PM

2, Can you also help me to understand this error? Error summary: HailException: For logistic regression, y at index 0 must be non-constant

3, I am not sure if I am implementing logistic regression correctly, can you let me know if there is anything I should change?

Thank you!

Hey @fengyi, I’m sorry you’re having trouble!

In the future, if you share the exact code you ran, we can more easily help you. It is also easier to help you when you copy and paste the exact output you get (like the is_case column).

As to your questions:

  1. You specified that your trait is not a quantitate phenotype. A binary (aka case-control) phenotype is most naturally by a Boolean value. Hail is designed to work properly with Boolean values.
  2. This error message indicates that all samples in your dataset have the same value for the first element of the y array. In other words, all your samples have the same value for is_case. This is clear from the is_case column you shared. All of your samples either have missing data or are controls. It is incorrect to encode your phenotype as 1s and 0s in a PLINK fam file. In a PLINK fam file, 1 always indicates a control, 2 always indicates a case, and -9, 0, and non-numeric values all indicate missing data. You can fix your fam file by importing it as a quantitative phenotype and then manually converting the 1s and 0s to Booleans:
mt = hl.import_plink(...)
mt = mt.annotate_columns(is_case = hl.bool(mt.quant_pheno))
mt = mt.drop('quant_pheno')
  1. If you have no missing data you should only specify one response variable: y=mt.is_case because there is only one degree of freedom: case versus not-case. If you do have missing data, then you can use y=[mt.is_case, mt.is_case] because you have two degrees of freedom: case versus not-case and missing versus not-missing.
1 Like

Hi @danking, thank you for your help!

I have imported it as quantitative phenotype and it seems like it automatically set 0 to -9.

when I tried mt = mt.annotate_columns(is_case = hl.bool(mt.quant_pheno))

Screen Shot 2020-09-23 at 10.49.03 AM

It seems like everything became true.

Do you know how to transform -9 to False and 1 to True?

Thank you!

You’re looking for case

mt.annotate_columns(is_case =
      .when(mt.quant_pheno == -9, False)
      .when(mt.quant_pheno == 1, True)
1 Like

2 posts were split to a new topic: When running Logistic Regression Rows I get a Py4JNetworkError