Search code examples
rsumsummarize

Sum is not computed over groups (always gives the absolute total)


I'm creating some summary tables and I'm having a hard time with simple sums...

While the count of records is correct, the variables with sums always compute the same value for all groups.

This is the code:

SummarybyCallContext <- PSTNRecords %>% 
                            group_by (PSTNRecords$destinationContext) %>% 
                                summarize(
                                  Calls = n(), 
                                  Minutes = sum(PSTNRecords$durationMinutes),
                                  Charges = sum(PSTNRecords$charge), 
                                  Fees = sum(PSTNRecords$connectionCharge)
                                )
                                  
SummarybyCallContext

And this is the result:

Result

Minutes and Charges should be different for each group (Fees is always zero, but I need to display it anyway in the table).

Setting na.rm to TRUE or FALSE doesn't seem to change the result.

What am I doing wrong?

Thanks in advance!

~Alienvolm


Solution

  • (Almost) Never use PSTNRecords$ within dplyr verb functions in a pipeline starting from PSTNRecords. Why? With the $-indexing, every reference is to the original data, before any grouping or filtering or adding/changing columns or rearranging is done. Without the $-referencing, it is using the columns as they appear at that point in the pipeline.

    SummarybyCallContext <- PSTNRecords %>% 
                                group_by (destinationContext) %>% 
                                    summarize(
                                      Calls = n(), 
                                      Minutes = sum(durationMinutes),
                                      Charges = sum(charge), 
                                      Fees = sum(connectionCharge)
                                    )
    

    There are exceptions to this, but they are rare and, for the vast majority of new dplyr users, generally done better via other mechanisms.

    Demonstration:

    dat <- data.frame(x=1:5)
    dat %>%
      filter(dat$x > 2) %>%      # this still works okay, since `dat` and "data now" are same
      summarize(x2 = dat$x[1])   # however, `dat` has 5 rows but data in pipe only has 3 rows
    #   x2
    # 1  1
    dat %>%
      filter(x > 2) %>%
      summarize(x2 = x[1])
    #   x2
    # 1  3