Search code examples
rtm

Removing non-English text from Corpus in R using tm()


I am using tm() and wordcloud() for some basic data-mining in R, but am running into difficulties because there are non-English characters in my dataset (even though I've tried to filter out other languages based on background variables.

Let's say that some of the lines in my TXT file (saved as UTF-8 in TextWrangler) look like this:

Special
satisfação
Happy
Sad
Potential für

I then read my txt file into R:

words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language = "lat"))

This yields the warning message:

Warning message:
In readLines(y, encoding = x$Encoding) :
  incomplete final line found on '/temp/file.txt'

But since it's a warning, not an error, I continue to push forward.

words <- tm_map(words, stripWhitespace)
words <- tm_map(words, tolower)

This then yields the error:

Error in FUN(X[[1L]], ...) : invalid input 'satisfa��o' in 'utf8towcs'

I'm open to finding ways to filter out the non-English characters either in TextWrangler or R; whatever is the most expedient. Thanks for your help!


Solution

  • Here's a method to remove words with non-ASCII characters before making a corpus:

    # remove words with non-ASCII characters
    # assuming you read your txt file in as a vector, eg. 
    # dat <- readLines('~/temp/dat.txt')
    dat <- "Special,  satisfação, Happy, Sad, Potential, für"
    # convert string to vector of words
    dat2 <- unlist(strsplit(dat, split=", "))
    # find indices of words with non-ASCII characters
    dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
    # subset original vector of words to exclude words with non-ASCII char
    dat4 <- dat2[-dat3]
    # convert vector back to a string
    dat5 <- paste(dat4, collapse = ", ")
    # make corpus
    require(tm)
    words1 <- Corpus(VectorSource(dat5))
    inspect(words1)
    
    A corpus with 1 text document
    
    The metadata consists of 2 tag-value pairs and a data frame
    Available tags are:
      create_date creator 
    Available variables in the data frame are:
      MetaID 
    
    [[1]]
    Special, Happy, Sad, Potential