Suppose I have a data frame that has 2 columns: "question_no" and "question_text"
"question_no" just goes from 1 to the length(data$question_no)
and "question_text" has questions.
I want to categorize the questions that have words "in order" and "summarize".
So far I've come up with these few lines of codes:
questions<-Corpus(VectorSouce(data$question_text))
questions<-tm_map(questions,tolower)
questions<-tm_map(questions,stripWhiteSpace)
spesificQuestion<- ifelse(Corpus=="in order"|Corpus=="summarize",pquestions, others=
I know it is a pretty awful set of codes, i just wanted show my intention.
What should I do to select certain words from a corpus?
With this data frame:
df <- data.frame(
question_no = c(1:6),
question_text = c("put these words in order","summarize the paper","nonsense",
"summarize the story", "put something in order", "nonsense")
)
question_no question_text
1 put these words in order
2 summarize the paper
3 nonsense
4 summarize the story
5 put something in order
6 nonsense
You could try...
library(stringr)
library(dplyr)
mutate (df, condition_met = if_else(str_detect(df$question_text,"\\bsummarize\\b|\\bin order\\b"), "Yes", "No"))
Which produces...
question_no question_text condition_met
1 put these words in order Yes
2 summarize the paper Yes
3 nonsense No
4 summarize the story Yes
5 put something in order Yes
6 nonsense No
stringr::str_detect
creates a logical vector equal to the length of the first argument. It searches each element in the original vector to see if it contains your desired string (or strings). Note that I'm checking for the word "summarize" and the words "in order" to avoid matching things like "un-summarize". If that doesn't matter to you, you can convert the matching string to ".*summarize.*|.*in order.*"
Using if_else
allows you to turn the TRUE
and FALSE
into whatever you want. In this case I did "yes" and "no".
dplyr::mutate
creates a new column named however you want. Leaving the values of TRUE and FALSE will allow you to see how many or what proportion of entries contain the strings you're interested in. If that's what you want then take out the if_else
argument, i.e....
mutate (df, condition_met = str_detect(df$question_text,"\\bsummarize\\b|\\bin order\\b"))