Search code examples
rif-statementcalculated-columnscorpus

R: How to copy a column such that if the original was "TRUE" in row x, the copy will be "TRUE" in rows x-250 through to x+250?


I hope this question is posed clearly! I have looked at many guides on loops and if_else clauses etc. but have not managed to figure this out.

I am trying to find passages in a large set of txt files where a number (say, 5) of keywords occur. Example keywords are "motion" and "cause". My data is tidy (the txt files have been split so that there is one word per row) and using regular expressions I have added columns (one for each keyword) that say "TRUE" if the row contains the keyword, and are false otherwise. Now in order to find passages of interest, I want to make a copy of each column that says "TRUE" in the same rows, but also in the 250 rows above and below those rows. So for example I want to copy the column that says "TRUE" when the row contains the word "motion", such that in the new column the 500 words surrounding the word "motion" are also "TRUE" (i.e. the 250 rows above and below the one where the word is). The idea is that I can then easily check whether there are any rows where all of the copied columns are true, indicating that there is a 500-word passage where all my keywords occur.

I have tried learning about and using loops in various ways to make these copied columns, but I have not had any success so far. This how my latest attempt looks, but it seems to have just designated the same rows as "TRUE" 250 times, rather than making the next 250 rows "TRUE". (It also gave the error message "Problem with 'mutate()' input 'copied_column'. subscript out of bounds i input 'copied_column' is 'case_when(...)'.")

n <-1
corpus <- corpus #>#
    mutate(copied_column = case_when(
      str_detect(original_column, "TRUE") ~ (repeat{
        n <- n+1
        str_detect(orginal_column, "FALSE")
        if (n == 250) {
          break
        }
       })
    ))

If anyone has any suggestion they would be most welcome. If you know any functions that I probably should be using or if you know how to properly use the ones in the above example, that would really help me out a lot.


Solution

  • Maybe the function below can solve the problem. Tested with fake data.

    segmentTRUE <- function(X, y, dist){
      f <- function(y, n, d){
        from <- max(1, y - d)
        to <- min(n, y + d)
        from:to
      }
      y <- deparse(substitute(y))
      w <- which(X[[y]])
      i <- Reduce(union, mapply(f, w, MoreArgs = list(n = nrow(X), d = dist)))
      X[i, y] <- TRUE
      X[[y]]
    }
    

    Test

    Make up some data and run the function in 3 different ways, two of them in a magrittr pipe.

    x <- rep(FALSE, 5e1)
    x[c(2, 10, 35, 47)] <- TRUE
    df1 <- data.frame(words = rep(letters, length.out = 5e1), x)
    head(df1)
    d <- 5
    
    segmentTRUE(df1, x, d)
    df1 %>% segmentTRUE(x, d)
    df1 %>% mutate(x = segmentTRUE(., x, d))
    

    Edit

    With nrow(df1) == 1e4, the following function is orders of magnitude faster than the Reduce version.

    segmentTRUE2 <- function(X, y, dist){
      f <- function(y, n, d){
        max(1, y - d):min(n, y + d)
      }
      y <- deparse(substitute(y))
      w <- which(X[[y]])
      i <- unique(unlist(mapply(f, w, MoreArgs = list(n = nrow(X), d = dist))))
      X[i, y] <- TRUE
      X[[y]]
    }
    
    identical(segmentTRUE(df1, x, d), segmentTRUE2(df1, x, d))
    #[1] TRUE