Search code examples
rdataframeloopsstandard-deviation

how to get standard deviation of an aggregate of all values of a dataframe (or matrix), grouped by one variable


I have a data frame with 13 columns. the 13th column shows the group number each row belongs to. I want to take the standard deviation of all values in all rows of columns 3 to 12, for rows of each group.

 for(i in 1: groupnumber) {  
sd.vect[i] <- sd(as.vector(df[df$group==i,][,-c(1,2,13)]))}

I get error

Error in is.data.frame(x) : 'list' object cannot be coerced to type 'double'

how can I get sd of all values in each group?


Solution

  • You can use dplyr. You can group_by(grouping_column), then summarise() with the sd() function:

    library(dplyr)
    
    data_frame%>%>group_by(thirteenth_column)%>%summarise(across(-c(1,2), sd))
    

    As an example with mtcars:

    library(dplyr)
    
    mtcars%>%group_by(cyl)%>%summarise(across(everything(), sd))
    
    # A tibble: 3 x 11
        cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1     4  4.51  26.9  20.9 0.365 0.570  1.68 0.302 0.467 0.539 0.522
    2     6  1.45  41.6  24.3 0.476 0.356  1.71 0.535 0.535 0.690 1.81 
    3     8  2.56  67.8  51.0 0.372 0.759  1.20 0     0.363 0.726 1.56 
    

    That said, with some corrections, your for loop works: You just have to make sure to return a dataframe, as your desired outcome is actually a two-dimension object. See an example with the iris dataset:

    sd.df<-data.frame()
    for(i in unique(iris$Species)) {
            for (j in 1:ncol(iris[-5])){
            sd.df[i,j] <- sd(iris[iris$Species==i,j])
            }
    }
    sd.df
    
                      V1        V2        V3        V4
    setosa     0.3524897 0.3790644 0.1736640 0.1053856
    versicolor 0.5161711 0.3137983 0.4699110 0.1977527
    virginica  0.6358796 0.3224966 0.5518947 0.2746501
    

    update
    I understand from your comments you may want something quite odd, which would be to group your data by the grouping variable, than get the standard deviation for an aggregate of all values in each sub-dataframe. You probably would be better off working with matrices here. For that, you may neet to group your data by unique() values in the grouping column, then call sd() on the rest of the dataframe (all values), which can be done if you coerce the dataframe into a matrix:

    library(dplyr)
    library(purrr)
    
    map_dbl(unique(mtcars$cyl), ~as.matrix(mtcars%>%
                                               filter(cyl==.x)%>%
                                               select(-cyl))%>%
                sd())%>%
            set_names(., unique(mtcars$cyl))
    
            6         4         8 
     62.47655  37.54494 118.18945 
    

    With your data:

    map(unique(df[[13]]), ~as.matrix(df%>%
                                       filter(df[[13]]==.x)%>%
                                       select(-c(1,2,13)))%>%
            sd()%>%
            set_names(., unique(df[[13]]))
    

    And a much simpler answer with base subsetting and split():

    map_dbl(split(mtcars[-c(1,2, 10)], mtcars[10]), ~sd(as.matrix(.x)))
    
            3         4         5 
    119.47824  47.97490  98.71733