With Julia GLM for categorical variable how to select the reference level?

It appears the reference level is selected as the first unique element of the categorical value. However, in my case, the reference is P (and not A).

ols_all= lm( @formula( value ~ Treatment ), s_treat)

gives

value ~ 1 + Treatment

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)   19.0845    0.531803  35.89    <1e-99   18.039     20.13
Treatment: P   5.5775    0.752082   7.42    <1e-12    4.09895    7.05605

What I really want is Treatment: A (P is the placebo or the control group). Granted I could rename the values of the variables. But in SAS and R it is possible to select the reference, hence I am hoping there is a way to do it with Julia GLM as well.

Solution

GLM.jl does not take the first unique element of CategoricalVector, but the first level in this column as a refrence. Therefore if you reorder levels you can change the reference and also the order of appearance of levels in the output. Here is an example:

julia> using CategoricalArrays

julia> using DataFrames

julia> using GLM

julia> y = rand(10)
10-element Vector{Float64}:
 0.6680787249599323
 0.4405942175942186
 0.012595806803754828
 0.21742822324104805
 0.4588945549282415
 0.05463125900208077
 0.5889309471773907
 0.014531957298235865
 0.8444514228200215
 0.13148010370633267

julia> x = categorical(rand(["a", "b", "c"], 10))
10-element CategoricalArray{String,1,UInt32}:
 "b"
 "b"
 "a"
 "a"
 "c"
 "a"
 "c"
 "c"
 "a"
 "b"

julia> df = DataFrame(x=x, y=y)
10×2 DataFrame
 Row │ x     y
     │ Cat…  Float64
─────┼─────────────────
   1 │ b     0.668079
   2 │ b     0.440594
   3 │ a     0.0125958
   4 │ a     0.217428
   5 │ c     0.458895
   6 │ a     0.0546313
   7 │ c     0.588931
   8 │ c     0.014532
   9 │ a     0.844451
  10 │ b     0.13148

julia> lm(@formula(y~x), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)  0.282277     0.165972  1.70    0.1328  -0.110185   0.674739
x: b         0.131108     0.253527  0.52    0.6210  -0.468388   0.730603
x: c         0.0718425    0.253527  0.28    0.7851  -0.527653   0.671338
────────────────────────────────────────────────────────────────────────

julia> levels(df.x)
3-element Vector{String}:
 "a"
 "b"
 "c"

julia> levels!(df.x, ["c", "b", "a"])
10-element CategoricalArray{String,1,UInt32}:
 "b"
 "b"
 "a"
 "a"
 "c"
 "a"
 "c"
 "c"
 "a"
 "b"

julia> lm(@formula(y~x), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
───────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)   0.354119     0.191648   1.85    0.1071  -0.0990568   0.807295
x: b          0.0592652    0.271031   0.22    0.8331  -0.581622    0.700153
x: a         -0.0718425    0.253527  -0.28    0.7851  -0.671338    0.527653
───────────────────────────────────────────────────────────────────────────

More advanced strategies are described here: https://juliastats.org/StatsModels.jl/stable/contrasts/.