maybe someone can answer my question. What is the difference between the following writings? In my case I am interested to know mean but I get different numbers.
> by(wcomp$numbf.y, wcomp$partw2, summary, na.rm = TRUE)
Mean 2.473
summary(wcomp$numbf.y, wcomp$partw2, na.rm = TRUE)
Mean 2.573
Thanks for your help
Without knowing your data: by
applies a function (summary
) to a vector (wcomp$numbf.y
) by a group (wcomp$partw2
).
Whereas summary
creates a summary of your data (kinda ignoring the second argument).
See also this MWE (Ive used the mtcars
dataset and set some values to NA
:
df <- mtcars
df[c(1, 5), c("cyl", "mpg")] <- NA
head(df)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 NA NA 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout NA NA 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
by(df$mpg, df$cyl, summary)
#> df$cyl: 4
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 21.40 22.80 26.00 26.66 30.40 33.90
#> ------------------------------------------------------------
#> df$cyl: 6
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 17.80 18.38 19.45 19.53 20.68 21.40
#> ------------------------------------------------------------
#> df$cyl: 8
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.40 14.30 15.20 14.82 15.80 19.20
summary(df$mpg, df$cyl)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 10.40 15.28 19.20 20.11 22.80 33.90 2
summary(df$mpg)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 10.40 15.28 19.20 20.11 22.80 33.90 2
summary(df$cyl)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 4.000 4.000 6.000 6.133 8.000 8.000 2
Created on 2020-10-07 by the reprex package (v0.3.0)
We see that the mean values are all different, as we are calculating different means: once for all obs (in the summary call), and when using the by
call, we calculate the summary per group (cyl).
We also see that the second argument to summary()
is ignored.
Does that answer your question?
If you are only interested in the mean, try
mean(df$mpg, na.rm = TRUE) #< na.rm needed here!
#> [1] 20.10667
by(df$mpg, df$cyl, mean)
#> df$cyl: 4
#> [1] 26.66364
#> ------------------------------------------------------
#> df$cyl: 6
#> [1] 19.53333
#> ------------------------------------------------------
#> df$cyl: 8
#> [1] 14.82308