Search code examples
rdplyrtsibble

dplyr group_by retaining extra columns after summarise


I am at a total loss for this one - I am playing with the "pedestrian" dataset from tsibble. I want to get total counts for each month/year. I started by adding a month_year column, then summarise the data with sum, like so:

library("tidyverse")
library("tsibble")

df1 <- pedestrian
df1$month_year <- format(as.Date(df1$Date), "%Y-%m")

count_all <- df1 %>%  
  dplyr::group_by(month_year) %>% 
  dplyr::summarise(total = sum(Count))

A summary of count_all looks like this:

  month_year          Date_Time                         total      
 Length:17542       Min.   :2015-01-01 00:00:00.0   Min.   :   12  
 Class :character   1st Qu.:2015-07-02 17:15:00.0   1st Qu.:  349  
 Mode  :character   Median :2016-01-01 11:30:00.0   Median : 2090  
                    Mean   :2016-01-01 11:44:40.2   Mean   : 2593  
                    3rd Qu.:2016-07-02 04:45:00.0   3rd Qu.: 4455  
                    Max.   :2016-12-31 23:00:00.0   Max.   :15990

Why is Date_Time being retained? And how can I prevent it form impacting the summary (as in prevent it from giving me 17,542 rows instead of the expected 24). If I remove the column before the summary like so:

df1$Date_Time <- NULL

Then it works fine, and a summary of the result looks like this:

  month_year            total        
 Length:24          Min.   :1148276  
 Class :character   1st Qu.:1756898  
 Mode  :character   Median :1927154  
                    Mean   :1895161  
                    3rd Qu.:2066043  
                    Max.   :2393675  

This solution is fine, but I would like to know what the cause of the issue is so that I can avoid it in future (it was easy to catch the problem this time, but may not always be so straight forward).

Thanks in advance for the help!


Solution

  • The dataset pedestrian is a tsibble with sensor as a key and Date_Time as the index. Any operation you do on the tsibble will retain the index. You can remove the index by converting back to a tibble.

    pedestrian %>%
      as_tibble() %>% 
      mutate(ym = yearmonth(Date)) %>% 
      dplyr::group_by(ym) %>% 
      dplyr::summarise(total = sum(Count))