Search code examples
rstringmatchstrsplit

Partial matching of elements in two string columns in R


I have a large data grouped by two identifiers (Group and ID), Initial column that shows in an initial time period, and a Post column to show elements that occur following the initial time period. A working examples is below:

SampleDF<-data.frame(Group=c(0,0,1),ID=c(2,2,3),
Initial=c('F28D,G06F','F24J ,'G01N'), 
Post=c('G06F','H02G','F23C,H02G,G01N'))

I want to compare elements in Initial and Post for each Group/ID combination to find out when elements match, when only new elements exist, and when both pre-existing and new elements exist. Ideally, I would like to end up with a new Type variable with the following output:

SampleDF<-cbind(SampleDF, 'Type'=rbind(0,1,2))

where (relative to Initial) 0 indicates that there are no new element(s) in Post, 1 indicates that there are only new element(s) in Post, and 2 indicates that there are both pre-existing and new element(s) in Post.


Solution

  • Your situation is complex since your pattern and vector varies while doing string matching using agrepl. So, here I come up with solution which is quite tricky but does the job very well.

    element_counter = list()
    for (i in 1:length(SampleDF$Initial)) {
      if (length(strsplit(as.character(SampleDF$Initial[i]), ",")[[1]]) > 1) {
        element_counter[[i]] <- length(as.character(SampleDF$Post[i])) - sum(agrepl(as.character(SampleDF$Post[i]),strsplit(as.character(SampleDF$Initial[i]), ",")[[1]]))
      }   else { 
        element_counter[[i]] <- length(strsplit(as.character(SampleDF$Post[i]), ",")[[1]]) - sum(agrepl(SampleDF$Initial[i], strsplit(as.character(SampleDF$Post[i]), ",")[[1]]))
      }
    }
    
    SampleDF$Type <- unlist(element_counter) 
    
    
    ## SampleDF
    #   Group  ID   Initial             Post  Type
    #1     0   2  F28D,G06F             G06F    0
    #2     0   2       F24J             H02G    1
    #3     1   3       G01N   F23C,H02G,G01N    2