How do I include a categorical variable like “smokes”, “smoked”, “never_smoked” as a covariate in a linear or logistic regression?
Great question! There’s some subtlety here around data representation and degrees of freedom. I encourage you to read the discussion at cross validated. For this post, I’ll assume you have one categorical covariate. We will “one-hot encode” that categorical covariate.
First, create some random data with the three smoking statuses, 50 people, 100 variants, and a random phenotype is_case
.
mt = hl.balding_nichols_model(1, 50, 100)
smoking_possibilities = hl.literal([
"previous_smoker", "current_smoker", "never_smoker"])
mt = mt.annotate_cols(
smoking_status = smoking_possibilities[hl.rand_cat([1,1,1])],
is_case = hl.rand_bool(0.5)
)
If you know all the possible categories you can enter them here as hl.literal(["foo", "bar", ...])
, or you can use the following expression to compute the possible categories from the data:
categories = mt.aggregate_cols(hl.agg.collect_as_set(mt.smoking_status))
The following code defines a one-hot encoding for this categorical variable as a dictionary. We show
it so as this pipeline runs you can see an example of what one-hot encodings look like.
n_categories = len(categories)
def one_hot_encode(i):
return hl.range(n_categories).map(lambda j: hl.int(i == j))
one_hot_encode = hl.literal(
{category: one_hot_encode(index)
for index, category in enumerate(categories)})
one_hot_encode.show()
Finally, we one-hot encode the smoking status. Unfortunately, Hail cannot directly accept array fields in the covariates. Instead, we explicitly break this array into the three components. We supply each component as a covariate to logistic_regression_rows
.
mt = mt.annotate_cols(
one_hot_encoded_smoking_status = one_hot_encode[mt.smoking_status]
)
mt = mt.annotate_cols(
smoking_status_0 = mt.one_hot_encoded_smoking_status[0],
smoking_status_1 = mt.one_hot_encoded_smoking_status[1],
smoking_status_2 = mt.one_hot_encoded_smoking_status[2]
)
hl.logistic_regression_rows(
test='wald',
y=mt.is_case,
x=mt.GT.n_alt_alleles(),
covariates=[mt.smoking_status_0, mt.smoking_status_1, mt.smoking_status_2]
).show()
One copy-pasteable block of code:
mt = hl.balding_nichols_model(1, 50, 100)
smoking_possibilities = hl.literal([
"previous_smoker", "current_smoker", "never_smoker"])
mt = mt.annotate_cols(
smoking_status = smoking_possibilities[hl.rand_cat([1,1,1])],
is_case = hl.rand_bool(0.5)
)
categories = mt.aggregate_cols(hl.agg.collect_as_set(mt.smoking_status))
n_categories = len(categories)
# hl.int(i == j) is 1 if i == j is true and 0 otherwise
def one_hot_encode(i):
return hl.range(n_categories).map(lambda j: hl.int(i == j))
one_hot_encode = hl.literal(
{category: one_hot_encode(index)
for index, category in enumerate(categories)})
one_hot_encode.show()
mt = mt.annotate_cols(one_hot_encoded_smoking_status = one_hot_encode[mt.smoking_status])
mt = mt.annotate_cols(
smoking_status_0 = mt.one_hot_encoded_smoking_status[0],
smoking_status_1 = mt.one_hot_encoded_smoking_status[1],
smoking_status_2 = mt.one_hot_encoded_smoking_status[2]
)
mt.cols().show()
hl.logistic_regression_rows(
test='wald',
y=mt.is_case,
x=mt.GT.n_alt_alleles(),
covariates=[mt.smoking_status_0, mt.smoking_status_1, mt.smoking_status_2]).show()
If we have a categorical variable with 3 categories as in the example, wouldn’t we only need to define 2 one-hot variables as covariates to save us from linear dependency? Apologies if I am thinking about this wrong…
Hail’s regression models do not implicitly include an intercept term, so if we were to use only two variables:
Then our model has no regression variable (beta) corresponding to one of the statuses. In particular, if x_1 is “previous smoker” and x_2 is “smoker”, then our model can never predict that a “never smoker” is a case. x_3 doesn’t appear in the model anywhere!
If you prefer to dummy encode your variables, just add an intercept term:
covariates=[1.0, mt.currently_smokes, mt.previously_smoked]
Things get more complicated when you have two or more categorical variables. In that case, the simple mechanical rule is: use an intercept and “dummy-encode” each categorical.
The intuition for not using multiple one-hot-encoded categorical variables is that all your cases have at least one of each category of each variable. For example, every case has one of the three smoking statuses and has one of your sexes. If being male and smoking perfectly predicts case status, you have a linear dependence! Either your “is male” beta can be one or your “smoker” beta can be one, and the linear model has no reason to prefer one to the other.
To be quite honest, I find interpreting a model with multiple independent categorical variables quite hard! I don’t actually know how to interpret the betas in this model, for example:
That makes sense, thanks so much! Your latter example is very similar to the issue that raised this question for me - I have 5 data collection sites in a categorical variable as well as sex (and other float covariates). Sounds like the best strategy is to use the dummy variables option; 4 of the 5 collection sites, is_male, and an intercept