Logistic regression implementation

fengyi · September 23, 2020, 3:29am

Hi,

I am new to hail and trying to implement binary trait in hail.

The set up is
1, The phenotype is in 1, 0 format
2, I imported plink file using hl.import_plink and set quant_pheno= False and set it to mt
3, I imported covariate using hl.import_table and use annotate_cols to incorporate covariates to the mt
4, last I did something like, gwas =hl.logistic_regression_rows(test=‘wald’,
y=[mt.is_case, mt.is_case],
x=mt.GT.n_alt_alleles()
, covariates= [1, mt.covar.1, mt.covar.2, …])
Question:
1, Why is my is case in bool format?
Screen Shot 2020-09-22 at 10.24.25 PM

2, Can you also help me to understand this error? Error summary: HailException: For logistic regression, y at index 0 must be non-constant

3, I am not sure if I am implementing logistic regression correctly, can you let me know if there is anything I should change?

Thank you!
-Fengyi

danking · September 23, 2020, 2:16pm

Hey @fengyi, I’m sorry you’re having trouble!

In the future, if you share the exact code you ran, we can more easily help you. It is also easier to help you when you copy and paste the exact output you get (like the is_case column).

As to your questions:

You specified that your trait is not a quantitate phenotype. A binary (aka case-control) phenotype is most naturally by a Boolean value. Hail is designed to work properly with Boolean values.
This error message indicates that all samples in your dataset have the same value for the first element of the y array. In other words, all your samples have the same value for is_case. This is clear from the is_case column you shared. All of your samples either have missing data or are controls. It is incorrect to encode your phenotype as 1s and 0s in a PLINK fam file. In a PLINK fam file, 1 always indicates a control, 2 always indicates a case, and -9, 0, and non-numeric values all indicate missing data. You can fix your fam file by importing it as a quantitative phenotype and then manually converting the 1s and 0s to Booleans:

mt = hl.import_plink(...)
mt = mt.annotate_columns(is_case = hl.bool(mt.quant_pheno))
mt = mt.drop('quant_pheno')

If you have no missing data you should only specify one response variable: y=mt.is_case because there is only one degree of freedom: case versus not-case. If you do have missing data, then you can use y=[mt.is_case, mt.is_case] because you have two degrees of freedom: case versus not-case and missing versus not-missing.

fengyi · September 23, 2020, 3:50pm

Hi @danking, thank you for your help!

I have imported it as quantitative phenotype and it seems like it automatically set 0 to -9.

when I tried mt = mt.annotate_columns(is_case = hl.bool(mt.quant_pheno))

Screen Shot 2020-09-23 at 10.49.03 AM

It seems like everything became true.

Do you know how to transform -9 to False and 1 to True?

Thank you!
-Fengyi

danking · September 23, 2020, 4:31pm

You’re looking for case

mt.annotate_columns(is_case =
    hl.case()
      .when(mt.quant_pheno == -9, False)
      .when(mt.quant_pheno == 1, True)
      .or_missing())

danking · September 23, 2020, 6:47pm

2 posts were split to a new topic: When running Logistic Regression Rows I get a Py4JNetworkError

Topic		Replies	Views
Logistic regression on remote servers Hail Query & hailctl	1	432	October 14, 2020
Annotation table for logreg Help [0.1]	4	749	October 16, 2018
Logistic regression on entries Hail Query & hailctl	10	1291	December 6, 2021
Py4JNetworkError when running logistic_regression_rows() Hail Query & hailctl	2	523	October 1, 2020
Error summary: HailException Hail Query & hailctl	4	546	September 26, 2022

Logistic regression implementation

Related topics