I'm creating some summary tables and I'm having a hard time with simple sums...
While the count of records is correct, the variables with sums always compute the same value for all groups.
This is the code:
SummarybyCallContext <- PSTNRecords %>%
group_by (PSTNRecords$destinationContext) %>%
summarize(
Calls = n(),
Minutes = sum(PSTNRecords$durationMinutes),
Charges = sum(PSTNRecords$charge),
Fees = sum(PSTNRecords$connectionCharge)
)
SummarybyCallContext
And this is the result:
Minutes and Charges should be different for each group (Fees is always zero, but I need to display it anyway in the table).
Setting na.rm to TRUE or FALSE doesn't seem to change the result.
What am I doing wrong?
Thanks in advance!
~Alienvolm
(Almost) Never use PSTNRecords$
within dplyr verb functions in a pipeline starting from PSTNRecords
. Why? With the $
-indexing, every reference is to the original data, before any grouping or filtering or adding/changing columns or rearranging is done. Without the $
-referencing, it is using the columns as they appear at that point in the pipeline.
SummarybyCallContext <- PSTNRecords %>%
group_by (destinationContext) %>%
summarize(
Calls = n(),
Minutes = sum(durationMinutes),
Charges = sum(charge),
Fees = sum(connectionCharge)
)
There are exceptions to this, but they are rare and, for the vast majority of new dplyr users, generally done better via other mechanisms.
Demonstration:
dat <- data.frame(x=1:5)
dat %>%
filter(dat$x > 2) %>% # this still works okay, since `dat` and "data now" are same
summarize(x2 = dat$x[1]) # however, `dat` has 5 rows but data in pipe only has 3 rows
# x2
# 1 1
dat %>%
filter(x > 2) %>%
summarize(x2 = x[1])
# x2
# 1 3