Search code examples
rdplyraveragetidyrnested-datalist

Compute average of grouped data.frame using dplyr and tidyr


I'm just learning R and trying to find ways to modify my grouped data.frame in order to get a mean of a variable value (x+y/2), and standard deviation (sd) sqrt((x^2+y^2)/2) of cohesive observations. Other (equal) variables (sequence, value1) should not change.

I used subset() and rowMeans(), but I wonder if there is a nicer way using dplyr and tidyr (probably using a nested dataframe?)

My test data.frame looks like:

id      location    value  sd    sequence value1
"anon1" "nose"      5      0.2    "a"      1
"anon2" "body"      4      0.4    "a"      2
"anon3" "left_arm"  3      0.3    "a"      3
"anon3" "right_arm" 5      0.6    "a"      3
"anon4" "head"      4      0.3    "a"      4
"anon5" "left_leg"  2      0.2    "a"      5
"anon5" "right_leg" 1      0.1    "a"      5

dput output of my test data.frame:

myData <- structure(list(ï..id = structure(c(1L, 2L, 3L, 3L, 4L, 5L, 5L
), .Label = c("anon1", "anon2", "anon3", "anon4", "anon5"), class = "factor"), 
    location = structure(c(5L, 1L, 3L, 6L, 2L, 4L, 7L), .Label = c("body", 
    "head", "left_arm", "left_leg", "nose", "right_arm", "right_leg"
    ), class = "factor"), value = c(5L, 4L, 3L, 5L, 4L, 2L, 1L
    ), sd = c(0.2, 0.4, 0.3, 0.6, 0.3, 0.2, 0.1), sequence = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L), .Label = "a", class = "factor"), 
    value1 = c(1L, 2L, 3L, 3L, 4L, 5L, 5L)), .Names = c("ï..id", 
"location", "value", "sd", "sequence", "value1"), class = "data.frame", row.names = c(NA, 
-7L))

how it should look:

id      location value sd   sequence value1
"anon1" "nose"   5     0.2  "a"      1 
"anon2" "body"   4     0.4  "a"      2
"anon3" "arm"    4     0.47 "a"      3
"anon4" "head"   4     0.3  "a"      4
"anon5" "leg"    1.5   0.15 "a"      5

Solution

  • dplyr's group_by and summarise will help, with some support from gsub for the string variable:

    library(dplyr)
    
    myData %>% 
      group_by(id) %>% 
      summarise(
        location = gsub(".*_", "", location[1]),
        value = mean(value),
        sd = mean(sd),
        sequence = sequence[1],
        value1 = value1[1]
      )
    #> # A tibble: 5 × 6
    #>       id location value    sd sequence value1
    #>   <fctr>    <chr> <dbl> <dbl>   <fctr>  <int>
    #> 1  anon1     nose   5.0  0.20        a      1
    #> 2  anon2     body   4.0  0.40        a      2
    #> 3  anon3      arm   4.0  0.45        a      3
    #> 4  anon4     head   4.0  0.30        a      4
    #> 5  anon5      leg   1.5  0.15        a      5
    

    Or if id, sequence, and value1 match across all cases:

    myData %>% 
      group_by(id, sequence, value1) %>% 
      summarise(
        location = gsub(".*_", "", location[1]),
        value = mean(value),
        sd = mean(sd))
    #> Source: local data frame [5 x 6]
    #> Groups: id, sequence [?]
    #> 
    #>       id sequence value1 location value    sd
    #>   <fctr>   <fctr>  <int>    <chr> <dbl> <dbl>
    #> 1  anon1        a      1     nose   5.0  0.20
    #> 2  anon2        a      2     body   4.0  0.40
    #> 3  anon3        a      3      arm   4.0  0.45
    #> 4  anon4        a      4     head   4.0  0.30
    #> 5  anon5        a      5      leg   1.5  0.15