Search code examples
rtmcorpusstop-words

Prevent tm from removing stopwords from double words


I'm trying to remove stopwords from a vector of characters. But the problem I'm facing is there is a word "king kond".Since 'king' is one of the stopwords, "king" in "king kong" is getting removed.

Is there a way to avoid double words from being removed? My code is:

text <- VCorpus(VectorSource(newmnt1$form)) 
#(newmnt1$form is  chr [1:4] "king kong lives" "foot" "island" "skull")

#Normal standardization of text.
text <- tm_map(text, content_transformer(tolower))
text <- tm_map(text, removeWords, custom_stopwords)
text <- tm_map(text, stripWhitespace)
newmnt2 <- text[[1]]$content

Solution

  • One quick hack would be to convert your "king kong" patterns to "king_kong".

    a <- gsub("king kong", "king_kong", "This is a pattern with king and king kong")
    a
    [1] "This is a pattern with king and king_kong"
    
    tm::removeWords(a, "king")
    [1] "This is a pattern with  and king_kong"
    

    Best,

    Colin