Search code examples
rdplyr

Make data frame of instances where two different key words are found within a text string


I have a data frame that contains two columns: an ID number and then a string of text:

df <- data.frame(ID=c(1, 2, 3, 4, 5, 6, 7, 8), 
                 text = c("lorem ipsum dolor sit ABC, consectetur adipiscing XYZ",
                          "veritatis et quasi ABC architecto beatae vitae dicta YXZ explicabo", 
                          "dignissimos ducimus CBA blanditiis praesentium ZXY deleniti", 
                          "earum rerum hic BCA tenetur a sapiente delectus, ut aut XYZ", 
                          "enim ad minima veniam, ACB quis nostrum corporis ZYX suscipit",
                          "cillum dolore BAC eu fugiat nulla pariatur ZXY",
                          "sunt CBA, ABC in culpa qui officia deserunt mollit XYZ anim",
                          "debitis ACB aut rerum necessitatibus YZX, XZY saepe eveniet"))

I also have two different lists containing specific search terms:

listX <- c("ABC", "ACB", "BAC", "BCA", "CAB", "CBA")
listY <- c("XYZ", "XZY", "YXZ", "YZX", "ZXY", "ZYX")

I would like to search through the text for each row of the data frame, and build a new data frame that contains in one column the ID number, and then in the others the results of a match/combination of the specific search terms in listX and listY.

output <- data.frame(ID=c(1,2,3,4,5,6,7,7,8,8),
                     X=c("ABC","ABC","CBA","BCA","ACB","BAC","CBA","ABC","ACB","ACB"),
                     Y=c("XYZ","YXZ","ZXY","XYZ","ZYX","ZXY","XYZ","XYZ","YZX","XZY"))

Is there some way to programmatically generate this output data frame with every possible combination? I know I could likely do this somehow with grepl and maybe merge for different results. But this would be an ugly brute force approach and the lists are a lot longer than given in this example. Thank you in advance!


Solution

  • library(dplyr)
    library(stringr)
    library(tidyr)
    
    df |>
      mutate(X = str_extract_all(text, str_flatten(listX, "|")),
             Y = str_extract_all(text, str_flatten(listY, "|")),
             across(X:Y, ~ replace(., lengths(.) == 0, NA))) |>
      unnest_longer(X:Y)
    

    Note: you might consider using word boundaries (\\b) when creating the regular expression. That way "ABC" does not match to "ABCDE". That would look something like:

    str_c("\\b", listX, "\\b", collapse = "|"))
    

    Edit

    When str_extract_all does not find a match it returns a zero length (empty) vector:

    x <- str_extract_all(c("This is a test.", "Another test ABC."), "ABC")
    # [[1]]
    # character(0)
    # 
    # [[2]]
    # [1] "ABC"
    

    When you try to combine an empty vector with another vector this element is simply dropped:

    unlist(x)
    # [1] "ABC"
    

    So I added the line across(...) to replace these empty values before the unnest statement to remedy this behavior.