Search code examples
runiquecontingency

R Display by Unique Value and Frequency


I have a dataset such as the one below except for around 5 million observations. I have already filtered the dates based on the time they were recorded in previous code to include only the calls made during working time. Now, I want to separate the dates based on WORKERCALL_ID in order to see a list of all of the unique dates for each worker and the number of times each WORKERCALL_ID shows up on each date (number of calls per date, separated by each WORKERCALL_ID. I tried to do this using a contingency matrix and then changing it to a data frame, but the file is so large that my R session always aborts. Does anyone have any idea how to accomplish this?

WORKERCALL_ID     DATE
124789244   02-01-2014
128324834   05-01-2014
124184728   06-10-2014

An example of desired output is below, for each WORKERCALL_ID and date. My end goal is to be able to subset the result and remove the rows/ID's with a high frequency of calls.

WORKERCALL_ID   DATE    FREQ
124789244   02-01-2014    4
124789244   02-23-2014    1

Solution

  • Two options :

    table(df$WORKERCALL_ID, df$DATE)
    

    Or, using dplyr (also including the requested added filtering out for IDs that have any cases of frequency higher than 5):

    df %>% group_by(WORKERCALL_ID, DATE) %>% summarize(freq=n()) %>% group_by(WORKERCALL_ID) %>% 
      filter(!any(freq>5))
    

    Example:

    rbind(as.data.frame(df),data.frame(WORKERCALL_ID=128324834, DATE="Moose",freq=6,stringsAsFactors = FALSE)) %>% group_by(WORKERCALL_ID) %>% filter(!any(freq>5))
    # A tibble: 2 x 3
    # Groups:   WORKERCALL_ID [2]
      WORKERCALL_ID DATE        freq
              <dbl> <chr>      <dbl>
    1    124184728. 06-10-2014    1.
    2    124789244. 02-01-2014    1.
    

    Note how ID 128324834 is removed from the final result.