Search code examples
rtext-mining

How to keep specific group of words or phrases of a text column in R?


I have a dataframe with a text column and I would like to create another column only with specific words or phrases matching the text column. Let's say I have these 4 rows in the dataframe:

   TEXT_COLUMN
1 "discovering the hidden themes in the collection."
2 "classifying the documents into the discovered themes."
3 "using the classification to organize/summarize/search the documents."
4 "alternatively, we can set a threshold on the score"

And, on the other hand, I have a list of words and phrases I want to keep. For example:

x <- c("hidden themes", "the documents", "discovered themes", "classification to organize", "search")

So, I would like to create a new column "KEYWORDS" with the words in "x" which match the text column separated by a comma:

   TEXT_COLUMN                                                             |  KEYWORDS
1 "discovering the hidden themes in the collection."                       |  "hidden themes"
2 "classifying the documents into the discovered themes."                  |  "the documents", "discovered themes"
3 "using the classification to organize/summarize/search the documents."   |  "classification to organize", "search"
4 "alternatively, we can set a threshold on the score"                     |  NA

Do you know any way to do this?

Thank you very much in advance.


Solution

  • An option is to create a pattern from 'x' by joining with str_c

    library(stringr)
    library(dplyr)
    pat <- str_c("\\b(", str_c(x, collapse="|"), ")\\b")
    

    Then, using this pattern, extract the substring from the 'TEXT_COLUMN' into a list column of vectors

    df1 <- df1 %>% 
          mutate(KEYWORDS = str_extract_all(TEXT_COLUMN, pat))
    

    -output

    df1
    #TEXT_COLUMN                                          KEYWORDS
    #1                     discovering the hidden themes in the collection.                                     hidden themes
    #2                classifying the documents into the discovered themes.                  the documents, discovered themes
    #3 using the classification to organize/summarize/search the documents. classification to organize, search, the documents
    #4                   alternatively, we can set a threshold on the score                                                  
    

    data

    df1 <- structure(list(TEXT_COLUMN = c("discovering the hidden themes in the collection.", 
    "classifying the documents into the discovered themes.", "using the classification to organize/summarize/search the documents.", 
    "alternatively, we can set a threshold on the score")), 
    class = "data.frame", row.names = c("1", 
    "2", "3", "4"))