Search code examples
rcomparedplyrgroupingsequential

Comparing between groups in grouped dataframe


I am trying to perform a comparison between items in subsequent groups in a dataframe - I guess this is pretty easy when you know what you are doing...

My data set can be represented as follows:

set.seed(1)
data <- data.frame(
 date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15)),
 id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)

Which yields a dataframe that looks like:

date    id
1/02/2015   1008
1/02/2015   1009
1/02/2015   1011
1/02/2015   1015
1/02/2015   1008
1/02/2015   1014
1/02/2015   1015
1/02/2015   1012
1/02/2015   1012
1/02/2015   1006
1/02/2015   1008
1/02/2015   1007
1/02/2015   1012
1/02/2015   1009
1/02/2015   1013
2/02/2015   1010
2/02/2015   1013
2/02/2015   1015
2/02/2015   1009
2/02/2015   1013
2/02/2015   1015
2/02/2015   1008
2/02/2015   1012
2/02/2015   1007
2/02/2015   1008
2/02/2015   1009
2/02/2015   1006
2/02/2015   1009
2/02/2015   1014
2/02/2015   1009
2/02/2015   1010
3/02/2015   1011
3/02/2015   1010
3/02/2015   1007
3/02/2015   1014
3/02/2015   1012
3/02/2015   1013
3/02/2015   1007
3/02/2015   1013
3/02/2015   1010

Then I want to group the data by date (group_by) and then filter out duplicates (distinct) before comparing between the groups. What I want to do is determine from day to day which new id's are added and which id's leave. So day 1 and day 2 would be compared to determine the id's in day 2 that were not in day 1 and the id's that were in day 1 but not present in day 2, then do the same comparisons between day 2 and day 3 etc.
The comparison can be done very easily using an anti_join (dplyr) but I don't know how to reference individual groups in the dataset.

My attempt (or one of my attempts) looks like:

data %>%
  group_by(date) %>%
  distinct(id) %>%
  do(lost = anti_join(., lag(.), by="id"))

But of course this does not work, I just get:

Error in anti_join_impl(x, y, by$x, by$y) : Can't join on 'id' x 'id' because of incompatible types (factor / logical)

Is what I am attempting to do even possible or should I be looking at writing a clunky function to do it?


Solution

  • I'm sure I don't get to vote for my own answer but I must say that I like mine the best. I was hoping to get an answer that used the dplyr tools to solve the problem so I kept researching and I think I now have a (semi) elegant solution (apart from the for loop in my function).

    Generating the sample data set the same way but with more data to make it more interesting:

    set.seed(1)
    data <- data.frame(
      date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15), rep('2015-02-04',15), rep('2015-02-05',15)),
      id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
    )
    

    Searching through the interweb I found the dplyr function 'nest()' which looked to solve all my grouping issues. The nest() function takes the groups created by group_by() and rolls them into a list of data frames so you end up with one entry for each variable you have grouped on and then a data frame for all of the remaining variables that fit into that group - here it is:

    dataNested <- data %>%
      group_by(date) %>%
      distinct(id) %>%
      nest()
    

    Which yields a fairly strange dataframe that looks like:

         date          data
    1    2015-02-01    list(id = c(3, 4, 6, 10, 9, 7, 1, 2, 8))
    2    2015-02-02    list(id = c(5, 8, 10, 4, 3, 7, 2, 1, 9))
    3    2015-02-03    list(id = c(6, 5, 2, 9, 7, 8))
    4    2015-02-04    list(id = c(1, 5, 8, 7, 9, 3, 4, 6, 10))
    5    2015-02-05    list(id = c(3, 5, 4, 7, 8, 1, 9))
    

    Whereby the indexes in the lists reference a list of the id's (strange but true).

    This now allows us to reference the groups by index number viz:

    dataNested$data[[2]]
    

    returns:

    # A tibble: 9 × 1
          id
      <fctr>
    1   1010
    2   1013
    3   1015
    4   1009
    5   1008
    6   1012
    7   1007
    8   1006
    

    From here it's a simple matter of writing a function that will do the anti_join to leave us with just the differences between each subsequent group (though this is the part I'm not proud of and really starts to show my lack of R skills - please feel free to suggest improvements):

    ## Function departed() - returns the id's that were dropped from each subsequent time period
    departed <- function(groups) {
      tempList <- vector("list", nrow(groups))
      # Loop through the groups and do an anti_join between each
      for (i in seq(1, nrow(groups) - 1)) {
      tempList[[i + 1]] <-
      anti_join(data.frame(groups$data[[i]]),  data.frame(groups$data[[i + 1]]), by = "id")
    
      }
      return(tempList)
    }
    

    Applying this function to our nested data yields the list of lists of departed id's:

    > departedIDs <- dataNested %>% departed()
    
    > departedIDs
    [[1]]
    NULL
    
    [[2]]
        id
    1 1011
    
    [[3]]
        id
    1 1006
    2 1008
    3 1009
    4 1015
    
    [[4]]
        id
    1 1007
    
    [[5]]
        id
    1 1011
    2 1015
    

    I hope this answer will help others who's brain works the same way as mine.