Search code examples
juliaglm

With Julia GLM for categorical variable how to select the reference level?


It appears the reference level is selected as the first unique element of the categorical value. However, in my case, the reference is P (and not A).

ols_all= lm( @formula( value ~ Treatment ), s_treat)

gives

value ~ 1 + Treatment

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)   19.0845    0.531803  35.89    <1e-99   18.039     20.13
Treatment: P   5.5775    0.752082   7.42    <1e-12    4.09895    7.05605

What I really want is Treatment: A (P is the placebo or the control group). Granted I could rename the values of the variables. But in SAS and R it is possible to select the reference, hence I am hoping there is a way to do it with Julia GLM as well.


Solution

  • GLM.jl does not take the first unique element of CategoricalVector, but the first level in this column as a refrence. Therefore if you reorder levels you can change the reference and also the order of appearance of levels in the output. Here is an example:

    julia> using CategoricalArrays
    
    julia> using DataFrames
    
    julia> using GLM
    
    julia> y = rand(10)
    10-element Vector{Float64}:
     0.6680787249599323
     0.4405942175942186
     0.012595806803754828
     0.21742822324104805
     0.4588945549282415
     0.05463125900208077
     0.5889309471773907
     0.014531957298235865
     0.8444514228200215
     0.13148010370633267
    
    julia> x = categorical(rand(["a", "b", "c"], 10))
    10-element CategoricalArray{String,1,UInt32}:
     "b"
     "b"
     "a"
     "a"
     "c"
     "a"
     "c"
     "c"
     "a"
     "b"
    
    julia> df = DataFrame(x=x, y=y)
    10×2 DataFrame
     Row │ x     y
         │ Cat…  Float64
    ─────┼─────────────────
       1 │ b     0.668079
       2 │ b     0.440594
       3 │ a     0.0125958
       4 │ a     0.217428
       5 │ c     0.458895
       6 │ a     0.0546313
       7 │ c     0.588931
       8 │ c     0.014532
       9 │ a     0.844451
      10 │ b     0.13148
    
    julia> lm(@formula(y~x), df)
    StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
    
    y ~ 1 + x
    
    Coefficients:
    ────────────────────────────────────────────────────────────────────────
                     Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
    ────────────────────────────────────────────────────────────────────────
    (Intercept)  0.282277     0.165972  1.70    0.1328  -0.110185   0.674739
    x: b         0.131108     0.253527  0.52    0.6210  -0.468388   0.730603
    x: c         0.0718425    0.253527  0.28    0.7851  -0.527653   0.671338
    ────────────────────────────────────────────────────────────────────────
    
    julia> levels(df.x)
    3-element Vector{String}:
     "a"
     "b"
     "c"
    
    julia> levels!(df.x, ["c", "b", "a"])
    10-element CategoricalArray{String,1,UInt32}:
     "b"
     "b"
     "a"
     "a"
     "c"
     "a"
     "c"
     "c"
     "a"
     "b"
    
    julia> lm(@formula(y~x), df)
    StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
    
    y ~ 1 + x
    
    Coefficients:
    ───────────────────────────────────────────────────────────────────────────
                      Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
    ───────────────────────────────────────────────────────────────────────────
    (Intercept)   0.354119     0.191648   1.85    0.1071  -0.0990568   0.807295
    x: b          0.0592652    0.271031   0.22    0.8331  -0.581622    0.700153
    x: a         -0.0718425    0.253527  -0.28    0.7851  -0.671338    0.527653
    ───────────────────────────────────────────────────────────────────────────
    

    More advanced strategies are described here: https://juliastats.org/StatsModels.jl/stable/contrasts/.