Search code examples
rdplyrtidyrmagrittr

What's a better way to summarize this data frame?


I have a data frame like this:

  message.id sender recipient
1          1      A         B
2          1      A         C
3          2      A         B
4          3      B         C
5          3      B         D
6          3      B         Q

I would like to summarize it by the counts of values in the sender and recipient columns to get this:

  address messages.sent messages.received
1       A             3                 0
2       B             3                 2
3       C             0                 2
4       D             0                 1
5       Q             0                 1

I have working code, but it's messy, and I'm hoping there's a way to do this all in one magrittr chain instead of what I have below:

df <- data.frame(message.id = c(1,1,2,3,3,3),
                 sender = c("A","A","A","B","B","B"),
                 recipient = c("B","C","B","C","D","Q"))
sent <- df %>% 
  group_by(sender) %>%
  summarise(messages.sent = n()) %>%
  mutate(address = sender) %>%
  select(address, messages.sent)

received <- df %>% 
  group_by(recipient) %>%
  summarise(messages.received = n()) %>%
  mutate(address = recipient) %>%
  select(address, messages.received)

df_summary <- merge(sent, received, all = TRUE) %>%
  replace(is.na(.), 0)

Solution

  • We can use melt/dcast

    library(reshape2)
    dcast(melt(df1, id.var='message.id'), value~variable, 
                     value.var='message.id', length)
    

    Or using a wrapper recast

    recast(df1, id.var='message.id', value~variable, length)
    #    value sender recipient
    #1     A      3         0
    #2     B      3         2
    #3     C      0         2
    #4     D      0         1
    #5     Q      0         1
    

    If we need to use dplyr/tidyr

    library(dplyr)
    library(tidyr)
    gather(df1, messages, address, 2:3) %>%
              group_by(messages, address) %>%
              summarise(n=n()) %>% 
              spread(messages, n, fill=0)
    #     address sender recipient
    #     (chr)  (dbl)     (dbl)
    #1       A      3         0
    #2       B      3         2
    #3       C      0         2
    #4       D      0         1
    #5       Q      0         1