Search code examples
rnlpsubsettmstop-words

R remove stopwords from a character vector using %in%


I have a data frame with strings that I'd like to remove stop words from. I'm trying to avoid using the tm package as it's a large data set and tm seems to run a bit slowly. I am using the tm stopword dictionary.

library(plyr)
library(tm)

stopWords <- stopwords("en")
class(stopWords)

df1 <- data.frame(id = seq(1,5,1), string1 = NA)
head(df1)
df1$string1[1] <- "This string is a string."
df1$string1[2] <- "This string is a slightly longer string."
df1$string1[3] <- "This string is an even longer string."
df1$string1[4] <- "This string is a slightly shorter string."
df1$string1[5] <- "This string is the longest string of all the other strings."

head(df1)
df1$string1 <- tolower(df1$string1)
str1 <-  strsplit(df1$string1[5], " ")

> !(str1 %in% stopWords)
[1] TRUE

This is not the answer I'm looking for. I'm trying to get a vector or string of the words NOT in the stopWords vector.

What am I doing wrong?


Solution

  • You are not accessing the list properly and you're not getting the elements back from the result of %in% (which gives a logical vector of TRUE/FALSE). You should do something like this:

    unlist(str1)[!(unlist(str1) %in% stopWords)]
    

    (or)

    str1[[1]][!(str1[[1]] %in% stopWords)]
    

    For the whole data.frame df1, you could do something like:

    '%nin%' <- Negate('%in%')
    lapply(df1[,2], function(x) {
        t <- unlist(strsplit(x, " "))
        t[t %nin% stopWords]
    })
    
    # [[1]]
    # [1] "string"  "string."
    # 
    # [[2]]
    # [1] "string"   "slightly" "string." 
    # 
    # [[3]]
    # [1] "string"  "string."
    # 
    # [[4]]
    # [1] "string"   "slightly" "shorter"  "string." 
    # 
    # [[5]]
    # [1] "string"   "string"   "strings."