Search code examples
rvariablesoutputmeansummary

What is the difference between by and summary?


maybe someone can answer my question. What is the difference between the following writings? In my case I am interested to know mean but I get different numbers.

> by(wcomp$numbf.y, wcomp$partw2, summary, na.rm = TRUE)

Mean 2.473

summary(wcomp$numbf.y, wcomp$partw2, na.rm = TRUE)

Mean 2.573

Thanks for your help


Solution

  • Without knowing your data: by applies a function (summary) to a vector (wcomp$numbf.y) by a group (wcomp$partw2).

    Whereas summarycreates a summary of your data (kinda ignoring the second argument).

    See also this MWE (Ive used the mtcars dataset and set some values to NA:

    
    df <- mtcars
    df[c(1, 5), c("cyl", "mpg")] <- NA
    head(df)
    #>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
    #> Mazda RX4           NA  NA  160 110 3.90 2.620 16.46  0  1    4    4
    #> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    #> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    #> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    #> Hornet Sportabout   NA  NA  360 175 3.15 3.440 17.02  0  0    3    2
    #> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
    
    by(df$mpg, df$cyl, summary)
    #> df$cyl: 4
    #>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #>   21.40   22.80   26.00   26.66   30.40   33.90 
    #> ------------------------------------------------------------ 
    #> df$cyl: 6
    #>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #>   17.80   18.38   19.45   19.53   20.68   21.40 
    #> ------------------------------------------------------------ 
    #> df$cyl: 8
    #>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #>   10.40   14.30   15.20   14.82   15.80   19.20
    
    summary(df$mpg, df$cyl)
    #>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    #>   10.40   15.28   19.20   20.11   22.80   33.90       2
    summary(df$mpg)
    #>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    #>   10.40   15.28   19.20   20.11   22.80   33.90       2
    summary(df$cyl)
    #>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    #>   4.000   4.000   6.000   6.133   8.000   8.000       2
    

    Created on 2020-10-07 by the reprex package (v0.3.0)

    We see that the mean values are all different, as we are calculating different means: once for all obs (in the summary call), and when using the by call, we calculate the summary per group (cyl).

    We also see that the second argument to summary() is ignored.

    Does that answer your question?

    If you are only interested in the mean, try

    mean(df$mpg, na.rm = TRUE) #< na.rm needed here!
    #> [1] 20.10667
    
    by(df$mpg, df$cyl, mean)
    #> df$cyl: 4
    #> [1] 26.66364
    #> ------------------------------------------------------ 
    #> df$cyl: 6
    #> [1] 19.53333
    #> ------------------------------------------------------ 
    #> df$cyl: 8
    #> [1] 14.82308