I have a dataframe with a text column and I would like to create another column only with specific words or phrases matching the text column. Let's say I have these 4 rows in the dataframe:
TEXT_COLUMN
1 "discovering the hidden themes in the collection."
2 "classifying the documents into the discovered themes."
3 "using the classification to organize/summarize/search the documents."
4 "alternatively, we can set a threshold on the score"
And, on the other hand, I have a list of words and phrases I want to keep. For example:
x <- c("hidden themes", "the documents", "discovered themes", "classification to organize", "search")
So, I would like to create a new column "KEYWORDS" with the words in "x" which match the text column separated by a comma:
TEXT_COLUMN | KEYWORDS
1 "discovering the hidden themes in the collection." | "hidden themes"
2 "classifying the documents into the discovered themes." | "the documents", "discovered themes"
3 "using the classification to organize/summarize/search the documents." | "classification to organize", "search"
4 "alternatively, we can set a threshold on the score" | NA
Do you know any way to do this?
Thank you very much in advance.
An option is to create a pattern from 'x' by joining with str_c
library(stringr)
library(dplyr)
pat <- str_c("\\b(", str_c(x, collapse="|"), ")\\b")
Then, using this pattern, extract the substring from the 'TEXT_COLUMN' into a list
column of vector
s
df1 <- df1 %>%
mutate(KEYWORDS = str_extract_all(TEXT_COLUMN, pat))
-output
df1
#TEXT_COLUMN KEYWORDS
#1 discovering the hidden themes in the collection. hidden themes
#2 classifying the documents into the discovered themes. the documents, discovered themes
#3 using the classification to organize/summarize/search the documents. classification to organize, search, the documents
#4 alternatively, we can set a threshold on the score
df1 <- structure(list(TEXT_COLUMN = c("discovering the hidden themes in the collection.",
"classifying the documents into the discovered themes.", "using the classification to organize/summarize/search the documents.",
"alternatively, we can set a threshold on the score")),
class = "data.frame", row.names = c("1",
"2", "3", "4"))