I'm trying to learn R. I've been trying to solve this problem for hours. I've searched and tried lots of things to fix this but no luck so far. So here we go; I'm downloading some random tweets from twitter (via twitteR). I can see all special characters when i check my dataframe (like; üğıİşçÇöÖ). I'm removing some stuff (like whitespace etc.) After all removing and manipulating my corpus everything looks fine. Character encoding problem starts when i try to create TermDocumentMatrix. After that "tdm" and "df" has some weird symbols and maybe lost some characters?? Here is the code;
tweetsg.df <- twListToDF(tweets)
#looks good. no encoding problems.
wordCorpus <- Corpus(VectorSource(tweetsg.df$text))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
#wordCorpus looks fine at this point.
tdm <- TermDocumentMatrix(wordCorpus, control = list(tokenize="scan",
wordLengths = c(3, Inf),language="Turkish"))
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 1)
df <- data.frame(term = names(term.freq), freq = term.freq)
At this point both tdm and df has weird symbols and missing characters.
Still no luck though! Any kind of help or pointers accepted :) PS: Non-english speaker AND R newbie here. Also if we can solve this i think i have a problem with emojis too. I would like to remove or even better USE them :)
I've managed to duplicate your issue, and make changes to get Turkish output. Try changing the line
wordCorpus <- Corpus(VectorSource(tweetsg.df$text))
to
wordCorpus <- Corpus(DataframeSource(data.frame(tweetsg.df$text)))
and adding a line similar to this.
Encoding(tweetsg.df$text) <- "UTF-8"
The code I got to work was
library(tm)
sampleTurkish <- "değiştirdik değiştirdik değiştirdik"
Encoding(sampleTurkish) <- "UTF-8"
#looks good. no encoding problems.
wordCorpus <- Corpus(DataframeSource(data.frame(sampleTurkish)))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
#wordCorpus looks fine at this point.
tdm <- TermDocumentMatrix(wordCorpus)
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 1)
df <- data.frame(term = names(term.freq), freq = term.freq)
print(findFreqTerms(tdm, lowfreq=2))
This only worked with a source
command from the console. i.e. clicking on run or source button in RStudio didn't work. I also made sure I chose "Save with Encoding" "UTF-8" (although this is probably only necessary because I have turkish text)
> source("Turkish.R")
[1] "değiştirdik"
It was the second answer R tm package: utf-8 text that was useful in the end.