Search code examples
rtweetssentiment-analysis

Twitter Sentiment Analysis w R using german language set SentiWS


I want to do a sentiment analysis of German tweets. The code I use works fine with English, but when I load the German word list, all scores just result zero. As far as I can guess, it must have to do with the different structures of the word lists. So what I need to know is, how to adapt my code to the structure of the German word-list. Someone could take a look at both of the lists ?

English Wordlist
German Wordlist

    # load the wordlists
    pos.words = scan("~/positive-words.txt",what='character', comment.char=';')
    neg.words = scan("~/negative-words.txt",what='character', comment.char=';')

        # bring in the sentiment analysis algorithm
        # we got a vector of sentences. plyr will handle a list or a vector as an "l" 
        # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
        score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
         { 
          require(plyr)
          require(stringr)
            scores = laply(sentences, function(sentence, pos.words, neg.words) 
            {
             # clean up sentences with R's regex-driven global substitute, gsub():
             sentence = gsub('[[:punct:]]', '', sentence)
             sentence = gsub('[[:cntrl:]]', '', sentence)
             sentence = gsub('\\d+', '', sentence)
             # and convert to lower case:
             sentence = tolower(sentence)
             # split into words. str_split is in the stringr package
             word.list = str_split(sentence, '\\s+')
             # sometimes a list() is one level of hierarchy too much
             words = unlist(word.list)
             # compare our words to the dictionaries of positive & negative terms
             pos.matches = match(words, pos.words)
             neg.matches = match(words, neg.words)
             # match() returns the position of the matched term or NA
             # we just want a TRUE/FALSE:
             pos.matches = !is.na(pos.matches)
             neg.matches = !is.na(neg.matches)
             # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
             score = sum(pos.matches) - sum(neg.matches)
             return(score)
            }, 
          pos.words, neg.words, .progress=.progress )
          scores.df = data.frame(score=scores, text=sentences)
          return(scores.df)
         }

    # and to see if it works, there should be a score...either in German or in English
    sample = c("ich liebe dich. du bist wunderbar","I hate you. Die!");sample
    test.sample = score.sentiment(sample, pos.words, neg.words);test.sample

Solution

  • This may work for you:

    readAndflattenSentiWS <- function(filename) { 
      words = readLines(filename, encoding="UTF-8")
      words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
      words <- unlist(strsplit(words, ","))
      words <- tolower(words)
      return(words)
    }
    pos.words <- c(scan("positive-words.txt",what='character', comment.char=';', quiet=T), 
                   readAndflattenSentiWS("SentiWS_v1.8c_Positive.txt"))
    neg.words <- c(scan("negative-words.txt",what='character', comment.char=';', quiet=T), 
                  readAndflattenSentiWS("SentiWS_v1.8c_Negative.txt"))
    
    score.sentiment = function(sentences, pos.words, neg.words, .progress='none') {
      # ... see OP ...
    }
    
    sample <- c("ich liebe dich. du bist wunderbar",
                "Ich hasse dich, geh sterben!", 
                "i love you. you are wonderful.",
                "i hate you, die.")
    (test.sample <- score.sentiment(sample, 
                                    pos.words, 
                                    neg.words))
    #   score                              text
    # 1     2 ich liebe dich. du bist wunderbar
    # 2    -2      ich hasse dich, geh sterben!
    # 3     2    i love you. you are wonderful.
    # 4    -2                  i hate you, die.