There's a dataset randomdat
contains 299 obs, two categorical variables, var 9
contains the values like With XYZ
and Without XYZ
, var8
contains values like Group A
/ Group B
/Group C
, var1
is a numerical variable.
Then there's a model:
m7 <- lm(var3~var1+I(var1^2)+I(var1^3)+var9, data=randomdat)
Check summary(m7)
, it shows Without XYZ
is always 34451.4 less than With XYZ
.
> summary(m7)
Call:
lm(formula = var3 ~ var1 + I(var1^2) + I(var1^3) + var9, data = randomdat)
Residuals:
Min 1Q Median 3Q Max
-391506 -75127 4799 77175 323856
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -162934.42035 18571.30251 -8.773 <0.0000000000000002 ***
var1 10927.87454 741.36511 14.740 <0.0000000000000002 ***
I(var1^2) -180.82979 10.44006 -17.321 <0.0000000000000002 ***
I(var1^3) 0.99499 0.04223 23.562 <0.0000000000000002 ***
var9Without XYZ -34451.43378 14570.55030 -2.364 0.0187 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 117500 on 294 degrees of freedom
Multiple R-squared: 0.8642, Adjusted R-squared: 0.8624
F-statistic: 467.9 on 4 and 294 DF, p-value: < 0.00000000000000022
Then there're two predict models:
m7_predictwith <- predict(m7,list(var1=randomdat$var1, var9 = rep("With XYZ",299)))
m7_predictwout <- predict(m7,list(var1=randomdat$var1, var9 = rep("Without XYZ",299)))
If you plot them, you will see the the two lines are not overlapping.
ggplot(randomdat, aes(x = var1, y = var3)) +
geom_point(aes(colour = var8, shape = var8)) +
geom_line(aes(x=randomdat$var1,y=m7_predictwith), color = 'red', lty = 2) +
geom_line(aes(x=randomdat$var1,y=m7_predictwout), color = 'black', lty = 3)
Now it comes with the question, how to understand var9 = rep("With XYZ",299)
or var9 = rep("Without XYZ",299)
in this case? Aren't them mean replace all the values in var9
to With XYZ
or Without XYZ
? The var1
is the same in m7_predictwith
and m7_predictwout
, the plot lines of them should be just one same line? Very confused about the syntax usage of rep()
in this case.
rep()
repeats values:
> rep("With XYZ", 5)
[1] "With XYZ" "With XYZ" "With XYZ" "With XYZ" "With XYZ"
Here it's being used to created data sets that contain:
var1
var9
.var9
is a factor variable, and in the regression its estimated coefficient is -34451.43378. So if you predict one line with a fixed value for var9
of "With XYZ"
, and then another line with a fixed value of "Without XYZ"
, the "Without XYZ"
line will be shifted down by a constant value of 34451, creating parallel lines.