Search code examples
rfor-loopdataframeerror-handlingrollapply

Find similar elements of a vector and modify everything inbetween


I have a large data frame (df) with binomial values ranging from 1 to 2. NAs are also included in the data. As a practical example, I will create a short vector containing a subset of a user's data:

df <- c(NA,NA,2,1,1,1,2,1,2,2,1,1,1,NA,2,2,1,2,1,1,1,2)

What I would basically like as an outcome is a function that searches for the first and the second 2s of an array and transforms everything within this interval to a 2. Nevertheless, if the difference between the positions of the second and the first 2 are > 3 then the values stay as they are and no change is performed.

In addition to the above, the function has to loop for each value of df. For example, considering again the case of:

df <- c(NA,NA,2,1,1,1,2,1,2,2,1,1,1,NA,2,2,1,2,1,1,1,2)

The function should have this outcome:

df_outcome <- c(NA,NA,2,1,1,1,2,2,2,2,1,1,1,NA,2,2,2,2,1,1,1,2)

Notice that in df_outcome the values between the very first and second 2's were not merged as the difference in their position was >3. On the other hand, other non-2 values were changed accordingly.

What I have tried to do (but it does not work properly):

With the help of rollapply in the zoo package, I have tried to create a function that finds the first and the second 2 of the array and performs the modifications as described above.

func <- function (q) {
  for (i in (which(q %in% 2)[1]):(which(q %in% 2)[2])) {
    q[i]<-2
  }
  return(q)
}

Then I nested it using rollapplyso I can specify a specific width for each cycle plus other arguments such as the position of the result index (left).

df_outcome<-rollapply(df, width = 3, FUN = func, fill = NA, partial = TRUE, align = "left")

The problem is that the user-generated function works if applied to a vector. When nested as an argument in the rollapply function however, it returns an error:

Error in (which(q %in% 2)[1]):(which(q %in% 2)[2]) : NA/NaN argument Called from: FUN(data[replace(posns, !ix, 0)], ...)

I guess there is some mistake from my part in the use of rollapply or perhaps the format of the data but I cannot understand what could be the issue. I thought about using rollapplybecause my data is very long and it is generated for different users. Hence, I would need a function that can also split the data with regards to other variables such as User_ID (much like the .variablesargument in ddply or by in data.table.

Looking forward for your support.


Solution

  • A solution with rle:

    rldf <- rle(df)
    rllag <- c(tail(rldf$values,-1), NA)
    rllead <- c(NA, head(rldf$values,-1))
    
    rldf$values[which(rldf$values == 1 & rllag == 2 & rllead == 2 & rldf$lengths < 3)] <- 2
    
    df_out <- inverse.rle(rldf)
    

    which gives:

    > df_out
     [1] NA NA  2  1  1  1  2  2  2  2  1  1  1 NA  2  2  2  2  1  1  1  2
    
    > identical(df_outcome,df_out)
    [1] TRUE