To do a linear regression in Julia
we can use the function lm
like this:
using DataFrames
using GLM
df = DataFrame(x=[2,3,2,1,3,5,7,4,2],
y=[2,3,5,1,5,6,4,2,3])
9×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 2 2
2 │ 3 3
3 │ 2 5
4 │ 1 1
5 │ 3 5
6 │ 5 6
7 │ 7 4
8 │ 4 2
9 │ 2 3
lm(@formula(y~x), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
y ~ 1 + x
Coefficients:
───────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept) 2.14516 1.11203 1.93 0.0951 -0.484383 4.77471
x 0.403226 0.303282 1.33 0.2254 -0.313923 1.12037
───────────────────────────────────────────────────────────────────────
But I was wondering how to do a linear regression per group in Julia
. Here is some reproducible data:
df = DataFrame(group = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
x=[2,3,2,1,3,5,7,4,2,3,4,5,2,6,3,1,6,1],
y=[2,3,5,1,5,6,4,2,3,3,2,4,7,1,8,4,3,1])
18×3 DataFrame
Row │ group x y
│ Char Int64 Int64
─────┼─────────────────────
1 │ A 2 2
2 │ A 3 3
3 │ A 2 5
4 │ A 1 1
5 │ A 3 5
6 │ A 5 6
7 │ A 7 4
8 │ A 4 2
⋮ │ ⋮ ⋮ ⋮
12 │ B 5 4
13 │ B 2 7
14 │ B 6 1
15 │ B 3 8
16 │ B 1 4
17 │ B 6 3
18 │ B 1 1
3 rows omitted
So I was wondering if anyone knows how to perform a linear regression per group (in this case for group A and B in df
) and get the statistical coefficients like p-value and R Square per group in Julia
?
UPDATE: In light of @matnor's comment about trouble in interpreting results of regression with baseline in answer, here is a better formula which gives clearer grouped results:
lm(@formula(y~0 + group + x & group), df)
With this regression the table is mostly self-explanatory. Note covariance still needs interpretation (but depending on context may be more applicable).
ORIGINAL ANSWER: It can be as simple as:
lm(@formula(y~1 + group + x*group), df)
GLM fits group
as a categorical variable, adding a dummy coefficient (maybe today the PC crowd will change this name) for each group. The interaction term x*group
adds another set of dummy coefficients. The first set, represents the intercept of each group, and the second represents the slope. Here are the results:
StatsModels.TableRegressionModel{...}
y ~ 1 + x + group + x & group
Coefficients:
──────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|)
──────────────────────────────────────────────────────
(Intercept) 2.14516 1.47762 1.45 0.1686
x 0.403226 0.402988 1.00 0.3340
group: B 2.62322 2.10649 1.25 0.2335
x & group: B -0.723079 0.557198 -1.30 0.2154
──────────────────────────────────────────────────────
Note that group A doesn't appear because it is the baseline group (and corresponds to first intercept/slope pair).
If you look at the numbers, for example, for group B, you can see 2.62322 and -0.723079 which need to be added to the baseline to get slope/intercept of group:
julia> # 4.76838, -0.319853 are group B slope/intercept
julia> 2.14516 + 2.62322 ≈ 4.76838 # intercept
true
julia> 0.403226 + -0.723079 ≈ -0.319853 # slope
true
There are some benefits in terms of efficiency to this method, as well as added flexibility (GLM has more features).