I am trying to perform a comparison between items in subsequent groups in a dataframe - I guess this is pretty easy when you know what you are doing...
My data set can be represented as follows:
set.seed(1)
data <- data.frame(
date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15)),
id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)
Which yields a dataframe that looks like:
date id
1/02/2015 1008
1/02/2015 1009
1/02/2015 1011
1/02/2015 1015
1/02/2015 1008
1/02/2015 1014
1/02/2015 1015
1/02/2015 1012
1/02/2015 1012
1/02/2015 1006
1/02/2015 1008
1/02/2015 1007
1/02/2015 1012
1/02/2015 1009
1/02/2015 1013
2/02/2015 1010
2/02/2015 1013
2/02/2015 1015
2/02/2015 1009
2/02/2015 1013
2/02/2015 1015
2/02/2015 1008
2/02/2015 1012
2/02/2015 1007
2/02/2015 1008
2/02/2015 1009
2/02/2015 1006
2/02/2015 1009
2/02/2015 1014
2/02/2015 1009
2/02/2015 1010
3/02/2015 1011
3/02/2015 1010
3/02/2015 1007
3/02/2015 1014
3/02/2015 1012
3/02/2015 1013
3/02/2015 1007
3/02/2015 1013
3/02/2015 1010
Then I want to group the data by date (group_by) and then filter out duplicates (distinct) before comparing between the groups. What I want to do is determine from day to day which new id's are added and which id's leave. So day 1 and day 2 would be compared to determine the id's in day 2 that were not in day 1 and the id's that were in day 1 but not present in day 2, then do the same comparisons between day 2 and day 3 etc.
The comparison can be done very easily using an anti_join (dplyr) but I don't know how to reference individual groups in the dataset.
My attempt (or one of my attempts) looks like:
data %>%
group_by(date) %>%
distinct(id) %>%
do(lost = anti_join(., lag(.), by="id"))
But of course this does not work, I just get:
Error in anti_join_impl(x, y, by$x, by$y) : Can't join on 'id' x 'id' because of incompatible types (factor / logical)
Is what I am attempting to do even possible or should I be looking at writing a clunky function to do it?
I'm sure I don't get to vote for my own answer but I must say that I like mine the best. I was hoping to get an answer that used the dplyr tools to solve the problem so I kept researching and I think I now have a (semi) elegant solution (apart from the for loop in my function).
Generating the sample data set the same way but with more data to make it more interesting:
set.seed(1)
data <- data.frame(
date = c(rep('2015-02-01',15), rep('2015-02-02',16), rep('2015-02-03',15), rep('2015-02-04',15), rep('2015-02-05',15)),
id = as.character(c(1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,16,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE), 1005 + sample.int(10,15,replace=TRUE)))
)
Searching through the interweb I found the dplyr function 'nest()' which looked to solve all my grouping issues. The nest() function takes the groups created by group_by() and rolls them into a list of data frames so you end up with one entry for each variable you have grouped on and then a data frame for all of the remaining variables that fit into that group - here it is:
dataNested <- data %>%
group_by(date) %>%
distinct(id) %>%
nest()
Which yields a fairly strange dataframe that looks like:
date data
1 2015-02-01 list(id = c(3, 4, 6, 10, 9, 7, 1, 2, 8))
2 2015-02-02 list(id = c(5, 8, 10, 4, 3, 7, 2, 1, 9))
3 2015-02-03 list(id = c(6, 5, 2, 9, 7, 8))
4 2015-02-04 list(id = c(1, 5, 8, 7, 9, 3, 4, 6, 10))
5 2015-02-05 list(id = c(3, 5, 4, 7, 8, 1, 9))
Whereby the indexes in the lists reference a list of the id's (strange but true).
This now allows us to reference the groups by index number viz:
dataNested$data[[2]]
returns:
# A tibble: 9 × 1
id
<fctr>
1 1010
2 1013
3 1015
4 1009
5 1008
6 1012
7 1007
8 1006
From here it's a simple matter of writing a function that will do the anti_join to leave us with just the differences between each subsequent group (though this is the part I'm not proud of and really starts to show my lack of R skills - please feel free to suggest improvements):
## Function departed() - returns the id's that were dropped from each subsequent time period
departed <- function(groups) {
tempList <- vector("list", nrow(groups))
# Loop through the groups and do an anti_join between each
for (i in seq(1, nrow(groups) - 1)) {
tempList[[i + 1]] <-
anti_join(data.frame(groups$data[[i]]), data.frame(groups$data[[i + 1]]), by = "id")
}
return(tempList)
}
Applying this function to our nested data yields the list of lists of departed id's:
> departedIDs <- dataNested %>% departed()
> departedIDs
[[1]]
NULL
[[2]]
id
1 1011
[[3]]
id
1 1006
2 1008
3 1009
4 1015
[[4]]
id
1 1007
[[5]]
id
1 1011
2 1015
I hope this answer will help others who's brain works the same way as mine.