How do I include a categorical variable as a covariate in my logistic or linear regression?

danking · April 3, 2020, 6:28pm

How do I include a categorical variable like “smokes”, “smoked”, “never_smoked” as a covariate in a linear or logistic regression?

danking · April 3, 2020, 6:34pm

Great question! There’s some subtlety here around data representation and degrees of freedom. I encourage you to read the discussion at cross validated. For this post, I’ll assume you have one categorical covariate. We will “one-hot encode” that categorical covariate.

First, create some random data with the three smoking statuses, 50 people, 100 variants, and a random phenotype is_case.

mt = hl.balding_nichols_model(1, 50, 100)
smoking_possibilities = hl.literal([
    "previous_smoker", "current_smoker", "never_smoker"])
mt = mt.annotate_cols(
    smoking_status = smoking_possibilities[hl.rand_cat([1,1,1])],
    is_case = hl.rand_bool(0.5)
)

If you know all the possible categories you can enter them here as hl.literal(["foo", "bar", ...]), or you can use the following expression to compute the possible categories from the data:

categories = mt.aggregate_cols(hl.agg.collect_as_set(mt.smoking_status))

The following code defines a one-hot encoding for this categorical variable as a dictionary. We show it so as this pipeline runs you can see an example of what one-hot encodings look like.

n_categories = len(categories)

def one_hot_encode(i):
    return hl.range(n_categories).map(lambda j: hl.int(i == j))

one_hot_encode = hl.literal(
    {category: one_hot_encode(index)
     for index, category in enumerate(categories)})
one_hot_encode.show()

Finally, we one-hot encode the smoking status. Unfortunately, Hail cannot directly accept array fields in the covariates. Instead, we explicitly break this array into the three components. We supply each component as a covariate to logistic_regression_rows.

mt = mt.annotate_cols(
    one_hot_encoded_smoking_status = one_hot_encode[mt.smoking_status]
)
mt = mt.annotate_cols(
    smoking_status_0 = mt.one_hot_encoded_smoking_status[0],
    smoking_status_1 = mt.one_hot_encoded_smoking_status[1],
    smoking_status_2 = mt.one_hot_encoded_smoking_status[2]
)
hl.logistic_regression_rows(
    test='wald',
    y=mt.is_case,
    x=mt.GT.n_alt_alleles(),
    covariates=[mt.smoking_status_0, mt.smoking_status_1, mt.smoking_status_2]
).show()

One copy-pasteable block of code:

mt = hl.balding_nichols_model(1, 50, 100)
smoking_possibilities = hl.literal([
    "previous_smoker", "current_smoker", "never_smoker"])
mt = mt.annotate_cols(
    smoking_status = smoking_possibilities[hl.rand_cat([1,1,1])],
    is_case = hl.rand_bool(0.5)
)

categories = mt.aggregate_cols(hl.agg.collect_as_set(mt.smoking_status))

n_categories = len(categories)

# hl.int(i == j) is 1 if i == j is true and 0 otherwise
def one_hot_encode(i):
    return hl.range(n_categories).map(lambda j: hl.int(i == j))

one_hot_encode = hl.literal(
    {category: one_hot_encode(index)
     for index, category in enumerate(categories)})
one_hot_encode.show()

mt = mt.annotate_cols(one_hot_encoded_smoking_status = one_hot_encode[mt.smoking_status])
mt = mt.annotate_cols(
    smoking_status_0 = mt.one_hot_encoded_smoking_status[0],
    smoking_status_1 = mt.one_hot_encoded_smoking_status[1],
    smoking_status_2 = mt.one_hot_encoded_smoking_status[2]
)
mt.cols().show()
hl.logistic_regression_rows(
    test='wald',
    y=mt.is_case,
    x=mt.GT.n_alt_alleles(),
    covariates=[mt.smoking_status_0, mt.smoking_status_1, mt.smoking_status_2]).show()

eatkinson · November 23, 2020, 6:52pm

If we have a categorical variable with 3 categories as in the example, wouldn’t we only need to define 2 one-hot variables as covariates to save us from linear dependency? Apologies if I am thinking about this wrong…

danking · November 23, 2020, 7:43pm

Hail’s regression models do not implicitly include an intercept term, so if we were to use only two variables:

case = \beta_1 x_1 + \beta_2 x_2

Then our model has no regression variable (beta) corresponding to one of the statuses. In particular, if x_1 is “previous smoker” and x_2 is “smoker”, then our model can never predict that a “never smoker” is a case. x_3 doesn’t appear in the model anywhere!

If you prefer to dummy encode your variables, just add an intercept term:

covariates=[1.0, mt.currently_smokes, mt.previously_smoked]

Things get more complicated when you have two or more categorical variables. In that case, the simple mechanical rule is: use an intercept and “dummy-encode” each categorical.

The intuition for not using multiple one-hot-encoded categorical variables is that all your cases have at least one of each category of each variable. For example, every case has one of the three smoking statuses and has one of your sexes. If being male and smoking perfectly predicts case status, you have a linear dependence! Either your “is male” beta can be one or your “smoker” beta can be one, and the linear model has no reason to prefer one to the other.

To be quite honest, I find interpreting a model with multiple independent categorical variables quite hard! I don’t actually know how to interpret the betas in this model, for example:

case = \beta_1 x_{previous\_smoker} + \beta_2 x_{smoker} + \beta_3 x_{is\_male} + \beta_4 1

eatkinson · November 23, 2020, 7:51pm

That makes sense, thanks so much! Your latter example is very similar to the issue that raised this question for me - I have 5 data collection sites in a categorical variable as well as sex (and other float covariates). Sounds like the best strategy is to use the dummy variables option; 4 of the 5 collection sites, is_male, and an intercept

Topic		Replies	Views
Linear/Logistic regression covariates Hail Query & hailctl	1	659	August 18, 2020
Categorical covariates in association analyses Feature Requests	3	1086	February 20, 2019
Categorical covariates Help [0.1]	1	805	April 21, 2017
Association analysis covariate Hail Query & hailctl	1	616	April 13, 2018
Programatically define covariates Hail Query & hailctl	2	498	May 3, 2019

How do I include a categorical variable as a covariate in my logistic or linear regression?

Related topics