Search code examples
rpattern-matchingsaxstring-matchingsliding-window

Count the appearance of a string and the belonging result in the rows above


I have a data frame like this:

df <- data.frame(value = c("a","b","b","d","a","b","b","d","a","b","c","d"), 
             pattern = c("NA","a","ab","abb","bbd","bda","dab","abb","bbd","bda","dab","abc"))

The value column indicates the actual behaviour, and the pattern shows the cummulative behaviour before this action happens. Now I want to compare the patterns with the 4 patterns above and count the number of appearances, plus the number of appearance of the belonging letter in the "value"-column, to calculate the expected result.

The result should look like this:

   value pattern apperance a b c d exp.result
1      a      NA      0    0 0 0 0       <NA>
2      b       a      0    0 0 0 0       <NA>
3      b      ab      0    0 0 0 0       <NA>
4      d     abb      0    0 0 0 0       <NA>
5      a     bbd      0    0 0 0 0       <NA>
6      b     bda      0    0 0 0 0       <NA>
7      b     dab      0    0 0 0 0       <NA>
8      d     abb      1    0 0 0 1         d
9      a     bbd      1    1 0 0 0         a
10     b     bda      1    0 1 0 0         b
11     c     dab      1    0 1 0 0         b
12     d     abc      0    0 0 0 0       <NA>

I hope somebody can help me with this problem.


Solution

  • You can use this approach :

    df <- data.frame(
            value = c("a","b","b","d","a","b","b","d","a","b","c","d"), 
            pattern = c(NA,"a","ab","abb","bbd","bda","dab","abb","bbd","bda","dab","abc"))
    
    win <- 4
    analyzeWindow <- function(idx){
      idxs <- max(1,idx-win):(idx-1)
      if(idx == 1) idxs <- integer()
      winDF <- df[idxs,]
      winDF <- winDF[na.omit(winDF$pattern == df$pattern[idx]),]
      expValWeights <- unlist(as.list(table(winDF$value)))
    
      c(appearances=nrow(winDF),expValWeights)
    }
    
    newCols <- t(sapply(1:nrow(df),analyzeWindow))
    df2 <- cbind(df,newCols)
    df2$exp.result <- colnames(newCols)[-1][max.col(newCols[,-1],ties.method='first')]
    df2$exp.result[rowSums(newCols[,-1]) == 0] <- NA
    
    > df2
    
       value pattern appearances a b c d exp.result
    1      a    <NA>           0 0 0 0 0       <NA>
    2      b       a           0 0 0 0 0       <NA>
    3      b      ab           0 0 0 0 0       <NA>
    4      d     abb           0 0 0 0 0       <NA>
    5      a     bbd           0 0 0 0 0       <NA>
    6      b     bda           0 0 0 0 0       <NA>
    7      b     dab           0 0 0 0 0       <NA>
    8      d     abb           1 0 0 0 1          d
    9      a     bbd           1 1 0 0 0          a
    10     b     bda           1 0 1 0 0          b
    11     c     dab           1 0 1 0 0          b
    12     d     abc           0 0 0 0 0       <NA>
    

    NOTE: This code requires the "value" column being of type factor. Use as.factor if it isn't.