Comparing between groups in grouped dataframe

I am trying to perform a comparison between items in subsequent groups in a dataframe - I guess this is pretty easy when you know what you are doing...

My data set can be represented as follows:

set.seed(1)
data <- data.frame(
 date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15)),
 id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)

Which yields a dataframe that looks like:

date    id
1/02/2015   1008
1/02/2015   1009
1/02/2015   1011
1/02/2015   1015
1/02/2015   1008
1/02/2015   1014
1/02/2015   1015
1/02/2015   1012
1/02/2015   1012
1/02/2015   1006
1/02/2015   1008
1/02/2015   1007
1/02/2015   1012
1/02/2015   1009
1/02/2015   1013
2/02/2015   1010
2/02/2015   1013
2/02/2015   1015
2/02/2015   1009
2/02/2015   1013
2/02/2015   1015
2/02/2015   1008
2/02/2015   1012
2/02/2015   1007
2/02/2015   1008
2/02/2015   1009
2/02/2015   1006
2/02/2015   1009
2/02/2015   1014
2/02/2015   1009
2/02/2015   1010
3/02/2015   1011
3/02/2015   1010
3/02/2015   1007
3/02/2015   1014
3/02/2015   1012
3/02/2015   1013
3/02/2015   1007
3/02/2015   1013
3/02/2015   1010

Then I want to group the data by date (group_by) and then filter out duplicates (distinct) before comparing between the groups. What I want to do is determine from day to day which new id's are added and which id's leave. So day 1 and day 2 would be compared to determine the id's in day 2 that were not in day 1 and the id's that were in day 1 but not present in day 2, then do the same comparisons between day 2 and day 3 etc.
The comparison can be done very easily using an anti_join (dplyr) but I don't know how to reference individual groups in the dataset.

My attempt (or one of my attempts) looks like:

data %>%
  group_by(date) %>%
  distinct(id) %>%
  do(lost = anti_join(., lag(.), by="id"))

But of course this does not work, I just get:

Error in anti_join_impl(x, y, by$x, by$y) : Can't join on 'id' x 'id' because of incompatible types (factor / logical)

Is what I am attempting to do even possible or should I be looking at writing a clunky function to do it?

Solution

I'm sure I don't get to vote for my own answer but I must say that I like mine the best. I was hoping to get an answer that used the dplyr tools to solve the problem so I kept researching and I think I now have a (semi) elegant solution (apart from the for loop in my function).

Generating the sample data set the same way but with more data to make it more interesting:

set.seed(1)
data <- data.frame(
  date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15), rep('2015-02-04',15), rep('2015-02-05',15)),
  id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)

Searching through the interweb I found the dplyr function 'nest()' which looked to solve all my grouping issues. The nest() function takes the groups created by group_by() and rolls them into a list of data frames so you end up with one entry for each variable you have grouped on and then a data frame for all of the remaining variables that fit into that group - here it is:

dataNested <- data %>%
  group_by(date) %>%
  distinct(id) %>%
  nest()

Which yields a fairly strange dataframe that looks like:

     date          data
1    2015-02-01    list(id = c(3, 4, 6, 10, 9, 7, 1, 2, 8))
2    2015-02-02    list(id = c(5, 8, 10, 4, 3, 7, 2, 1, 9))
3    2015-02-03    list(id = c(6, 5, 2, 9, 7, 8))
4    2015-02-04    list(id = c(1, 5, 8, 7, 9, 3, 4, 6, 10))
5    2015-02-05    list(id = c(3, 5, 4, 7, 8, 1, 9))

Whereby the indexes in the lists reference a list of the id's (strange but true).

This now allows us to reference the groups by index number viz:

dataNested$data[[2]]

returns:

# A tibble: 9 × 1
      id
  <fctr>
1   1010
2   1013
3   1015
4   1009
5   1008
6   1012
7   1007
8   1006

From here it's a simple matter of writing a function that will do the anti_join to leave us with just the differences between each subsequent group (though this is the part I'm not proud of and really starts to show my lack of R skills - please feel free to suggest improvements):

## Function departed() - returns the id's that were dropped from each subsequent time period
departed <- function(groups) {
  tempList <- vector("list", nrow(groups))
  # Loop through the groups and do an anti_join between each
  for (i in seq(1, nrow(groups) - 1)) {
  tempList[[i + 1]] <-
  anti_join(data.frame(groups$data[[i]]),  data.frame(groups$data[[i + 1]]), by = "id")

  }
  return(tempList)
}

Applying this function to our nested data yields the list of lists of departed id's:

> departedIDs <- dataNested %>% departed()

> departedIDs
[[1]]
NULL

[[2]]
    id
1 1011

[[3]]
    id
1 1006
2 1008
3 1009
4 1015

[[4]]
    id
1 1007

[[5]]
    id
1 1011
2 1015

I hope this answer will help others who's brain works the same way as mine.