Search code examples
rdataframecorpusstop-words

R - Delete stops words in a dataframe


I am working with text analytics. I needed to count sentences. My code is:

library(dplyr)
library(tidytext)
txt <- readLines("consolidado.txt",encoding="UTF-8")
txt = iconv(txt, to="ASCII//TRANSLIT")
text_df <- data_frame(line = 1:392, text = txt)
palabras1 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 1)
palabras2 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 2)
palabras3 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 3)
palabras4 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 4)
palabras5 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 5)
palabras6 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 6)
palabras7 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 7)

First I convert the txt in a dataframe and later I work with tidytext. This work nice, but the problem is the stop words. I want to delete the stops word in the dataframe but I dont know how. I tried to convert it in a corpus, but in this way don´t work because although it eliminates the stops words later it can´t count the sentences.

is there some way for deleting the stop words in a dataframe???

thank you


Solution

  • I tried with anti_join... but i get this error:

    by required, because the data sources have no common variables
    

    Googling about this problem I tried with:

    by = NULL
    by = c("a" = "b")
    by = c(namecolumn = namecolumn)
    

    and many ways more with "by", but I didn´t get it.

    Finally I got it with this solution:

    library(tm)
    library(dplyr)
    library(tidytext)
    
    txt <- readLines("consolidado.txt",encoding="UTF-8")
    txt = iconv(txt, to="ASCII//TRANSLIT")
    text_df <- data_frame(line = 1:392, text = txt)
    
    text_df$text = removeWords(text_df$text, stopwords("spanish"))
    text_df$text = stripWhitespace(text_df$text)
    

    The library tm has the spanish stopwords.

    I select the column with the text in my dataframe, by default this column is called text. Later I use the function removeWords to erase the stopwords. The last line is to delete double whitespaces after to delete stopwords.

    Thanks for the help.