Hi, I have got covariate such as RACE that’s basically string/categorical, I know hail currently doesn’t support this type, so what’s the syntax to convert that column in my samples_table from string to numeric? Thanks.
Suppose the matrix table mt
has a column field RACE
of type String
that can take three values: "EUR"
, "ASN"
, "AFR"
. Since Hail internally adds an intercept covariate (which takes the value 1
for every sample), you’ll only want to add two covariates to account for RACE
. Were you to add three covariates, the design matrix would be singular.
The simplest approach is a dummy encoding that encodes each category as a Boolean field. You could add these column fields to mt
with annotate_cols if you’ll reuse them elsewhere in your analyses. But if you only want to use them in a regression, you might as well just create them on the fly. In 0.2 syntax, this looks like:
mt = hl.linear_regression(y = ..., x = ..., covariates = [mt.RACE == 'EUR', mt.RACE == 'ASN'])