I have a dataset such as the one below except for around 5 million observations. I have already filtered the dates based on the time they were recorded in previous code to include only the calls made during working time. Now, I want to separate the dates based on WORKERCALL_ID in order to see a list of all of the unique dates for each worker and the number of times each WORKERCALL_ID shows up on each date (number of calls per date, separated by each WORKERCALL_ID. I tried to do this using a contingency matrix and then changing it to a data frame, but the file is so large that my R session always aborts. Does anyone have any idea how to accomplish this?
WORKERCALL_ID DATE
124789244 02-01-2014
128324834 05-01-2014
124184728 06-10-2014
An example of desired output is below, for each WORKERCALL_ID and date. My end goal is to be able to subset the result and remove the rows/ID's with a high frequency of calls.
WORKERCALL_ID DATE FREQ
124789244 02-01-2014 4
124789244 02-23-2014 1
Two options :
table(df$WORKERCALL_ID, df$DATE)
Or, using dplyr (also including the requested added filtering out for IDs that have any cases of frequency higher than 5):
df %>% group_by(WORKERCALL_ID, DATE) %>% summarize(freq=n()) %>% group_by(WORKERCALL_ID) %>%
filter(!any(freq>5))
Example:
rbind(as.data.frame(df),data.frame(WORKERCALL_ID=128324834, DATE="Moose",freq=6,stringsAsFactors = FALSE)) %>% group_by(WORKERCALL_ID) %>% filter(!any(freq>5))
# A tibble: 2 x 3
# Groups: WORKERCALL_ID [2]
WORKERCALL_ID DATE freq
<dbl> <chr> <dbl>
1 124184728. 06-10-2014 1.
2 124789244. 02-01-2014 1.
Note how ID 128324834 is removed from the final result.