I have a dataframe where each row represents a single occurrence that took place in a city. The dataframe displays the name of the city and the date of the occurrence, like so:
df <- data.frame(city = c("Seattle", "Seattle", "Seattle", "Seattle", "Seattle", "NYC", "NYC", "NYC", "Chicago",
"Chicago", "Chicago", "Chicago", "Chicago"),
date_of_event = c("01/13/2011", "01/17/2011", "03/15/2011", "05/21/2011", "05/23/2011",
"01/20/2011", "01/22/2011", "03/23/2011", "01/18/2011", "02/24/2011",
"02/26/2011", "04/30/2011", "06/18/2011"),
stringsAsFactors = FALSE)
df$date_of_event <- as.Date(df$date_of_event, "%m/%d/%Y")
The above is just an example, my data is actually in a csv with thousands of rows, many cities, many dates, etc. What I want to do is generate a new dataframe that has a row for each city and each month/year represented in the dataset, and a corresponding count column that displays how many occurrences took place in each city in each month in the original dataframe. That 2nd dataframe would look like this:
df2 <- data.frame(city = c("Seattle", "Seattle", "Seattle", "Seattle", "Seattle", "Seattle", "NYC", "NYC", "NYC", "NYC",
"NYC", "NYC", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago"),
month_year = c("01/01/2011", "02/01/2011", "03/01/2011", "04/01/2011", "05/01/2011", "06/01/2011",
"01/01/2011", "02/01/2011", "03/01/2011", "04/01/2011", "05/01/2011", "06/01/2011",
"01/01/2011", "02/01/2011", "03/01/2011", "04/01/2011", "05/01/2011", "06/01/2011"),
count = c(2, 0, 1, 0, 2, 0, 2, 0, 1, 0, 0, 0, 1, 2, 0, 1, 0, 1),
stringsAsFactors = FALSE)
df2$month_year <- as.Date(df2$month_year, "%m/%d/%Y")
I know that you can use count from dplyr and also lubridate to round dates down to the first day of each month, but I've tried and failed to get the grouping and counting right in order to produce the 2nd dataframe I want.
You can try this:
library(tidyverse)
library(lubridate)
df3 <- df %>% mutate(new_date = floor_date(date_of_event, "month"))
tt <- as.data.frame(table(df3[-2]))
tt[order(desc(tt$city), tt$new_date),]
city new_date Freq
Seattle 2011-01-01 2
Seattle 2011-02-01 0
Seattle 2011-03-01 1
Seattle 2011-04-01 0
Seattle 2011-05-01 2
Seattle 2011-06-01 0
NYC 2011-01-01 2
NYC 2011-02-01 0
NYC 2011-03-01 1
NYC 2011-04-01 0
NYC 2011-05-01 0
NYC 2011-06-01 0
Chicago 2011-01-01 1
Chicago 2011-02-01 2
Chicago 2011-03-01 0
Chicago 2011-04-01 1
Chicago 2011-05-01 0
Chicago 2011-06-01 1
To include extended periods with zero count, you can try this:
# assign a name to the output obtained previously
df4 <- tt[order(desc(tt$city), tt$new_date),]
a <- mdy("01/01/11") # starting period
b <- a + months(0:92) # period sequence
df5 <- expand.grid(city = c("Chicago", "Seattle", "NYC"), new_date = as.factor(b))
df6 <- setdiff(df5, df4[-3])
df6$Freq <- 0 # assign zero count
df7 <- rbind(df4, df6)
df8 <- df7[order(df7$city, df7$new_date), ]