Search code examples
rdata-visualizationdata-analysisrep

How to understand the usage of rep() in this case?


There's a dataset randomdat contains 299 obs, two categorical variables, var 9 contains the values like With XYZ and Without XYZ, var8 contains values like Group A/ Group B/Group C, var1 is a numerical variable.

Then there's a model:

m7 <- lm(var3~var1+I(var1^2)+I(var1^3)+var9, data=randomdat)

Check summary(m7), it shows Without XYZ is always 34451.4 less than With XYZ.

> summary(m7)

Call:
lm(formula = var3 ~ var1 + I(var1^2) + I(var1^3) + var9, data = randomdat)

Residuals:
    Min      1Q  Median      3Q     Max 
-391506  -75127    4799   77175  323856 

Coefficients:
                     Estimate    Std. Error t value            Pr(>|t|)    
(Intercept)     -162934.42035   18571.30251  -8.773 <0.0000000000000002 ***
var1              10927.87454     741.36511  14.740 <0.0000000000000002 ***
I(var1^2)          -180.82979      10.44006 -17.321 <0.0000000000000002 ***
I(var1^3)             0.99499       0.04223  23.562 <0.0000000000000002 ***
var9Without XYZ  -34451.43378   14570.55030  -2.364              0.0187 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 117500 on 294 degrees of freedom
Multiple R-squared:  0.8642,    Adjusted R-squared:  0.8624 
F-statistic: 467.9 on 4 and 294 DF,  p-value: < 0.00000000000000022

Then there're two predict models:

m7_predictwith <- predict(m7,list(var1=randomdat$var1, var9 = rep("With XYZ",299)))
m7_predictwout <- predict(m7,list(var1=randomdat$var1, var9 = rep("Without XYZ",299)))

If you plot them, you will see the the two lines are not overlapping.

ggplot(randomdat, aes(x = var1, y = var3)) + 
    geom_point(aes(colour = var8, shape = var8)) + 
    geom_line(aes(x=randomdat$var1,y=m7_predictwith), color = 'red', lty = 2) + 
    geom_line(aes(x=randomdat$var1,y=m7_predictwout), color = 'black', lty = 3)

enter image description here

Now it comes with the question, how to understand var9 = rep("With XYZ",299) or var9 = rep("Without XYZ",299) in this case? Aren't them mean replace all the values in var9 to With XYZ or Without XYZ? The var1 is the same in m7_predictwith and m7_predictwout, the plot lines of them should be just one same line? Very confused about the syntax usage of rep() in this case.


Solution

  • rep() repeats values:

    > rep("With XYZ", 5)
    [1] "With XYZ" "With XYZ" "With XYZ" "With XYZ" "With XYZ"
    

    Here it's being used to created data sets that contain:

    • The observed values of var1
    • A fixed value of var9.

    var9 is a factor variable, and in the regression its estimated coefficient is -34451.43378. So if you predict one line with a fixed value for var9 of "With XYZ", and then another line with a fixed value of "Without XYZ", the "Without XYZ" line will be shifted down by a constant value of 34451, creating parallel lines.