Search code examples
rtmcorpus

How to search for words in a Corpus?


Suppose I have a data frame that has 2 columns: "question_no" and "question_text" "question_no" just goes from 1 to the length(data$question_no) and "question_text" has questions. I want to categorize the questions that have words "in order" and "summarize". So far I've come up with these few lines of codes:

questions<-Corpus(VectorSouce(data$question_text))
questions<-tm_map(questions,tolower)
questions<-tm_map(questions,stripWhiteSpace)
spesificQuestion<- ifelse(Corpus=="in order"|Corpus=="summarize",pquestions, others=

I know it is a pretty awful set of codes, i just wanted show my intention.

What should I do to select certain words from a corpus?


Solution

  • With this data frame:

       df <- data.frame(
       question_no = c(1:6),
       question_text = c("put these words in order","summarize the  paper","nonsense",
       "summarize the story", "put something in order", "nonsense")
       )
    
        question_no            question_text
           1             put these words in order
           2             summarize the paper
           3             nonsense
           4             summarize the story
           5             put something in order
           6             nonsense
    

    You could try...

         library(stringr)
         library(dplyr)
         mutate (df, condition_met = if_else(str_detect(df$question_text,"\\bsummarize\\b|\\bin order\\b"), "Yes", "No"))
    

    Which produces...

      question_no            question_text         condition_met
           1         put these words in order           Yes
           2         summarize the paper                Yes
           3         nonsense                           No
           4         summarize the story                Yes
           5         put something in order             Yes
           6         nonsense                           No
    

    stringr::str_detect creates a logical vector equal to the length of the first argument. It searches each element in the original vector to see if it contains your desired string (or strings). Note that I'm checking for the word "summarize" and the words "in order" to avoid matching things like "un-summarize". If that doesn't matter to you, you can convert the matching string to ".*summarize.*|.*in order.*" Using if_else allows you to turn the TRUE and FALSE into whatever you want. In this case I did "yes" and "no".

    dplyr::mutate creates a new column named however you want. Leaving the values of TRUE and FALSE will allow you to see how many or what proportion of entries contain the strings you're interested in. If that's what you want then take out the if_else argument, i.e....

         mutate (df, condition_met = str_detect(df$question_text,"\\bsummarize\\b|\\bin order\\b"))