I'm using tidytext
's built-in anti_join(get_stopwords())
command to clean documents from a data of customer review of tech products, but I found out the output corpus consists primarily of tech specification (e.g., Windows 10, 720p Camera, 380.6 x 258.2 x 22.45 (inches), IntelCore, etc.) and comes with little adjectives and nouns indicative a customer's satisfaction of a product).
Is there any handy ways to compile a list of tech terms to remove (such as those listed earlier) and manually insert it into get_stopwords()
or equivalent functions to better identify those non-tech adjectives and nouns in customer reviews?
You can create a data frame of your own stop words. This example uses a novel by HG Wells and two user-specified stop words (thanks to https://www.tidytextmining.com/tidytext.html). I don't know if there is a reputable corpus out there of tech-related stop words.
hgwells <- gutenberg_download(35)
my_stop_words <- data.frame(word=c('time','machine')) # list of your stop words
hgwells %>% unnest_tokens(word,text) %>%
anti_join(my_stop_words) # removes words 'time' and 'machine'