Search code examples
julialinear-regressiongroup

Linear regression per group in Julia


To do a linear regression in Julia we can use the function lm like this:

using DataFrames
using GLM

df = DataFrame(x=[2,3,2,1,3,5,7,4,2],
                      y=[2,3,5,1,5,6,4,2,3])
9×2 DataFrame
 Row │ x      y     
     │ Int64  Int64 
─────┼──────────────
   1 │     2      2
   2 │     3      3
   3 │     2      5
   4 │     1      1
   5 │     3      5
   6 │     5      6
   7 │     7      4
   8 │     4      2
   9 │     2      3

lm(@formula(y~x), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
───────────────────────────────────────────────────────────────────────
                Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  2.14516     1.11203   1.93    0.0951  -0.484383    4.77471
x            0.403226    0.303282  1.33    0.2254  -0.313923    1.12037
───────────────────────────────────────────────────────────────────────

But I was wondering how to do a linear regression per group in Julia. Here is some reproducible data:

df = DataFrame(group = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
                               'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
                      x=[2,3,2,1,3,5,7,4,2,3,4,5,2,6,3,1,6,1],
                      y=[2,3,5,1,5,6,4,2,3,3,2,4,7,1,8,4,3,1])
18×3 DataFrame
 Row │ group  x      y     
     │ Char   Int64  Int64 
─────┼─────────────────────
   1 │ A          2      2
   2 │ A          3      3
   3 │ A          2      5
   4 │ A          1      1
   5 │ A          3      5
   6 │ A          5      6
   7 │ A          7      4
   8 │ A          4      2
  ⋮  │   ⋮      ⋮      ⋮
  12 │ B          5      4
  13 │ B          2      7
  14 │ B          6      1
  15 │ B          3      8
  16 │ B          1      4
  17 │ B          6      3
  18 │ B          1      1
             3 rows omitted

So I was wondering if anyone knows how to perform a linear regression per group (in this case for group A and B in df) and get the statistical coefficients like p-value and R Square per group in Julia?


Solution

  • UPDATE: In light of @matnor's comment about trouble in interpreting results of regression with baseline in answer, here is a better formula which gives clearer grouped results:

    lm(@formula(y~0 + group + x & group), df)
    

    With this regression the table is mostly self-explanatory. Note covariance still needs interpretation (but depending on context may be more applicable).

    ORIGINAL ANSWER: It can be as simple as:

    lm(@formula(y~1 + group + x*group), df)
    

    GLM fits group as a categorical variable, adding a dummy coefficient (maybe today the PC crowd will change this name) for each group. The interaction term x*group adds another set of dummy coefficients. The first set, represents the intercept of each group, and the second represents the slope. Here are the results:

    StatsModels.TableRegressionModel{...}
    
    y ~ 1 + x + group + x & group
    
    Coefficients:
    ──────────────────────────────────────────────────────
                      Coef.  Std. Error      t  Pr(>|t|)
    ──────────────────────────────────────────────────────
    (Intercept)    2.14516     1.47762    1.45    0.1686
    x              0.403226    0.402988   1.00    0.3340
    group: B       2.62322     2.10649    1.25    0.2335
    x & group: B  -0.723079    0.557198  -1.30    0.2154
    ──────────────────────────────────────────────────────
    

    Note that group A doesn't appear because it is the baseline group (and corresponds to first intercept/slope pair).

    If you look at the numbers, for example, for group B, you can see 2.62322 and -0.723079 which need to be added to the baseline to get slope/intercept of group:

    julia> # 4.76838, -0.319853 are group B slope/intercept
    
    julia> 2.14516 + 2.62322 ≈ 4.76838 # intercept
    true
    
    julia> 0.403226 + -0.723079 ≈ -0.319853 # slope
    true
    

    There are some benefits in terms of efficiency to this method, as well as added flexibility (GLM has more features).