I have a large data grouped by two identifiers (Group and ID), Initial
column that shows in an initial time period, and a Post
column to show elements that occur following the initial time period. A working examples is below:
SampleDF<-data.frame(Group=c(0,0,1),ID=c(2,2,3),
Initial=c('F28D,G06F','F24J ,'G01N'),
Post=c('G06F','H02G','F23C,H02G,G01N'))
I want to compare elements in Initial
and Post
for each Group/ID
combination to find out when elements match, when only new elements exist, and when both pre-existing and new elements exist. Ideally, I would like to end up with a new Type
variable with the following output:
SampleDF<-cbind(SampleDF, 'Type'=rbind(0,1,2))
where (relative to Initial
) 0
indicates that there are no new element(s) in Post
, 1
indicates that there are only new element(s) in Post
, and 2
indicates that there are both pre-existing and new element(s) in Post
.
Your situation is complex since your pattern
and vector
varies while doing string matching using agrepl
. So, here I come up with solution which is quite tricky but does the job very well.
element_counter = list()
for (i in 1:length(SampleDF$Initial)) {
if (length(strsplit(as.character(SampleDF$Initial[i]), ",")[[1]]) > 1) {
element_counter[[i]] <- length(as.character(SampleDF$Post[i])) - sum(agrepl(as.character(SampleDF$Post[i]),strsplit(as.character(SampleDF$Initial[i]), ",")[[1]]))
} else {
element_counter[[i]] <- length(strsplit(as.character(SampleDF$Post[i]), ",")[[1]]) - sum(agrepl(SampleDF$Initial[i], strsplit(as.character(SampleDF$Post[i]), ",")[[1]]))
}
}
SampleDF$Type <- unlist(element_counter)
## SampleDF
# Group ID Initial Post Type
#1 0 2 F28D,G06F G06F 0
#2 0 2 F24J H02G 1
#3 1 3 G01N F23C,H02G,G01N 2