Search code examples
rtext-mining

Removing stopwords from R data frame column


Here's the situation, one whose solution seemed to be simple at first, but that has turned out to be more complicated than I expected.

I have an R data frame with three columns: an ID, a column with texts (reviews), and one with numeric values which I want to predict based on the text.

I have already done some preprocessing on the text column, so it is free of punctuation, in lower case, and ready to be tokenized and turned into a matrix so I can train a model on it. The problem is I can't figure out how to remove the stop words from that text.

Here's what I am trying to do with the text2vec package. I was planning on doing the stop-word removal before this chunk at first. But anywhere will do.

library(text2vec)

test_data <- data.frame(review_id=c(1,2,3),
                        review=c('is a masterpiece a work of art',
                        'sporting some of the best writing and voice work',
                        'better in every possible way when compared'),
                         score=c(90, 100, 100))

tokens <- word_tokenizer(test_data$review)
document_term_matrix <- create_dtm(itoken(tokens), hash_vectorizer())
model_tfidf <- TfIdf$new()
document_term_matrix <- model_tfidf$fit_transform(document_term_matrix)

document_term_matrix <- as.matrix(document_term_matrix)

I am hoping to get the review column to be something like:

review=c('masterpiec work art',
         'sporting best writing voice work',
         'better possible way compared')

Solution

  • You can use tidytext package for this :

    library(tidytext)
    library(dplyr)
    
    test_data %>%
      unnest_tokens(review, review) %>%
      anti_join(stop_words, by= c("review" = "word"))
    
    #    review_id      review score
    #1.2         1 masterpiece    90
    #1.6         1         art    90
    #2           2    sporting   100
    #2.5         2     writing   100
    #2.7         2       voice   100
    #3.6         3    compared   100
    

    To get the words back in one row you could do :

    test_data %>%
      unnest_tokens(review, review) %>%
      anti_join(stop_words, by= c("review" = "word")) %>%
      group_by(review_id, score) %>%
      summarise(review = paste0(review, collapse = ' '))
    
    #  review_id score review                
    #      <dbl> <dbl> <chr>                 
    #1         1    90 masterpiece art       
    #2         2   100 sporting writing voice
    #3         3   100 compared