Search code examples
rdataframesummary

summarizing data frame by columns in R


I have this data frame df:

df <- structure(list(App = structure(c(4L, 4L, 3L, 3L, 2L, 2L, 1L), .Label = c("DB", 
"End", "Mid", "Web"), class = "factor"), Server = structure(c(5L, 
6L, 1L, 2L, 3L, 4L, 7L), .Label = c("GServer101", "Hserver103", 
"JServer100", "Kserver200", "Server101", "Server102", "Xdb101"
), class = "factor"), Process1 = c(1L, 5L, 1L, 1L, 1L, 1L, 1L
), Process2 = c(1L, 1L, 1L, 4L, 1L, 1L, 1L), Process3 = c(NA, 
NA, NA, NA, NA, NA, NA), Process4 = c(NA, NA, NA, NA, NA, NA, 
NA), Process5 = c(NA, NA, NA, 1L, 1L, 1L, 1L)), .Names = c("App", 
"Server", "Process1", "Process2", "Process3", "Process4", "Process5"
), class = "data.frame", row.names = c(NA, -7L))

I would like to be able to summarize df data frame and count and place process by columns as below. I need to know how many process each app has group by column name. How would I do this in R?

end <- structure(list(App = structure(c(4L, 3L, 2L, 1L), .Label = c("DB", 
"End", "Mid", "Web"), class = "factor"), Process1 = c(6L, 2L, 
2L, 1L), Process2 = c(2L, 5L, 2L, 1L), Process3 = c(0L, 0L, 0L, 
0L), Process4 = c(0L, 0L, 0L, 0L), Process5 = c(0L, 1L, 2L, 1L
)), .Names = c("App", "Process1", "Process2", "Process3", "Process4", 
"Process5"), class = "data.frame", row.names = c(NA, -4L))

Solution

  • You can use dplyr:

    library(dplyr)
    df %>% 
          group_by(App) %>% 
          summarize_at(vars(starts_with("Process")), funs(sum(., na.rm=TRUE)))
    
    # A tibble: 4 × 6
    #     App Process1 Process2 Process3 Process4 Process5
    #  <fctr>    <int>    <int>    <int>    <int>    <int>
    #1     DB        1        1        0        0        1
    #2    End        2        2        0        0        2
    #3    Mid        2        5        0        0        1
    #4    Web        6        2        0        0        0
    

    Or if column positions are preferred, the positions can be passed to .cols parameter:

    df %>% 
           group_by(App) %>% 
           summarize_at(.cols=3:7, funs(sum(., na.rm=TRUE)))
    
    # A tibble: 4 × 6
    #     App Process1 Process2 Process3 Process4 Process5
    #  <fctr>    <int>    <int>    <int>    <int>    <int>
    #1     DB        1        1        0        0        1
    #2    End        2        2        0        0        2
    #3    Mid        2        5        0        0        1
    #4    Web        6        2        0        0        0