Search code examples
dplyrtext-miningstop-wordstidytext

Manually inserting topic-specific stopwords


I'm using tidytext's built-in anti_join(get_stopwords()) command to clean documents from a data of customer review of tech products, but I found out the output corpus consists primarily of tech specification (e.g., Windows 10, 720p Camera, 380.6 x 258.2 x 22.45 (inches), IntelCore, etc.) and comes with little adjectives and nouns indicative a customer's satisfaction of a product).

Is there any handy ways to compile a list of tech terms to remove (such as those listed earlier) and manually insert it into get_stopwords() or equivalent functions to better identify those non-tech adjectives and nouns in customer reviews?


Solution

  • You can create a data frame of your own stop words. This example uses a novel by HG Wells and two user-specified stop words (thanks to https://www.tidytextmining.com/tidytext.html). I don't know if there is a reputable corpus out there of tech-related stop words.

    hgwells <- gutenberg_download(35)
    my_stop_words <- data.frame(word=c('time','machine')) # list of your stop words
    hgwells %>% unnest_tokens(word,text) %>% 
      anti_join(my_stop_words) # removes words 'time' and 'machine'