Search code examples
rdataframesapplygrepl

checking whether a word from a vector appears in the same row in different columns of a data frame


I am trying to troubleshoot my data, and check whether a certain name appears in two different columns in the same row (same observation):

df1 <- data.frame(
  text1 = c("John Jay Jakson",
            "John Jay Jakson",
            "John Jay Jakson",
            "John Jack Jakson"), 
  text2 = c("Jerry Jack Jameson", 
            "Jerry Jack Jameson", 
            "Jerry Jack Jameson", 
            "Jerry Jack Jameson"))

df2 <- data.frame(
  names = c("John", "Jay", "Jackson", "Jerry", "Jack", "Jameson"))

The code I've come up with is the following

data.check = sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2))==TRUE)

or alternatively:

which(sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2))==TRUE))

But these are not the best ways to sift through the data. Instead, I want to have a new column in df1, df1$check, that holds a 1/0 based on whether each row has the same name in that row under df1$text1 and df1$text2.

I know that assigning this code to a new column won't work:

df1$check = sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2))==TRUE)

it gives me false for the 4th row, which should be a true.

Any help appreciated, thank you.


Solution

  • The output from sapply in the OP's code returns a logical matrix.

    > sapply(df2$names, function(x) (grepl(x, df1$text1) & grepl(x, df1$text2)))
          John   Jay Jackson Jerry  Jack Jameson
    [1,] FALSE FALSE   FALSE FALSE FALSE   FALSE
    [2,] FALSE FALSE   FALSE FALSE FALSE   FALSE
    [3,] FALSE FALSE   FALSE FALSE FALSE   FALSE
    [4,] FALSE FALSE   FALSE FALSE  TRUE   FALSE
    

    Each column of the matrix should be converged to a single logical value to create a vector. We may wrap with rowSums on the logical matrix and then convert the row wise sum to logical vector (> 0) and coerce it back to binary (+ - TRUE -> 1, FALSE -> 0)

    df1$check <- +(rowSums(sapply(df2$names, function(x) 
         (grepl(x, df1$text1) & grepl(x, df1$text2)))) > 0)
    df1$check
    [1] 0 0 0 1
    

    Or another option is to loop with lapply, return a list and use Reduce with | to return a vector

    df1$check <- +(Reduce(`|`, lapply(df2$names, function(x)
          (grepl(x, df1$text1) & grepl(x, df1$text2)))))