The original tweets have been saved into a file in the following structure:
tweet language || tweet
The following is my pre-processing stage to remove URL's, RT, usernames and any non-alpha numeric character.
def cleanTweets() {
File dirtyTweets = new File("result.txt")
File cleanTweets = new File("cleanTweets.txt")
try {
Scanner console = new Scanner(dirtyTweets)
PrintWriter printWriter = new PrintWriter(new BufferedWriter(new FileWriter(cleanTweets)))
LinkedHashSet<String> ln = new LinkedHashSet<String>();
while (console.hasNextLine()) {
String line = console.nextLine();
String[] splitter = line.split("\\|\\|\\|")
//Only looks at the english tweets
if (splitter[0] == "en") {
line = line.replaceFirst("en", "")
String urlIdentifier = "((http|ftp|https):\\/\\/)?[\\w\\-_]+(\\.[\\w\\-_]+)+([\\w\\-\\.,@?^=%&:/~\\+#]*[\\w\\-\\@?^=%&/~\\+#])?"
//Removes URL's, RT, Twitter usernames and any non alpha numeric character
String[] removeNoise = ["RT", urlIdentifier, "(?:\\s|\\A)[@]+([A-Za-z0-9-_]+)", "[^a-zA-Z0-9 ]"]
removeNoise.each { noise ->
line = line.replaceAll(noise, "").toLowerCase()
}
ln.add(line)
}
}
ln.each { line ->
printWriter.write(line)
printWriter.println()
}
//write to file here
} catch (IOException e) {
}
}
This is then saved into a new file. What would the next stage be for sentiment analysis of these tweets?
Here is some sample code for using the sentiment annotator:
package edu.stanford.nlp.examples;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.sentiment.*;
import edu.stanford.nlp.util.*;
import java.util.Properties;
public class SentimentExample {
public static void main(String[] args) {
Annotation document = new Annotation("...insert tweet text here...");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,sentiment");
// you might want to enforce treating the entire tweet as one sentence
//...if so uncomment the line below setting ssplit.eolonly to true
// also make sure you remove newlines, this will prevent the
// sentence splitter from dividing the tweet into different sentences
//props.setProperty("ssplit.eolonly","true");
props.setProperty("parse.binaryTrees","true");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
System.out.println("---");
System.out.println(sentence.get(CoreAnnotations.TextAnnotation.class));
System.out.println(sentence.get(SentimentCoreAnnotations.SentimentClass.class));
}
}
}