text twitter nlp classification training-data

Twitter Subjectivity Training Sets

I need a reliable and accurate method to filter tweets as subjective or objective. In other words I need to build a filter in something like Weka using a training set.

Are there any training sets available which could be used as a subjective/objective classifier for Twitter messages or other domains which may be transferable?

Solution

For research and non-profit purposes, SentiWordNet gives you exactly what you want. A commercial license is available too.

SentiWordNet : http://sentiwordnet.isti.cnr.it/

Sample Jave Code: http://sentiwordnet.isti.cnr.it/code/SWN3.java

The other approach I would try:

Example

Tweet 1: @xyz u should see the dark knight. Its awesme.

1) First a dictionary lookup for the for meanings.

"u" and "awesme" will not return anything.

2) Then go against the known abbreviations/shorthands and substitute matches with the expansions (Some resources: netlingo http://www.netlingo.com/acronyms.php or smsdictionary http://www.smsdictionary.co.uk/abbreviations)

Now the original tweet will look like:

Tweet 1: @xyz you should see the dark knight. Its awesme.

3) Then feed the remaining words in the spell checker and substitute with the best match (not always ideal and error prone for small words)

Now the original tweet will look like:

Tweet 1: @xyz you should see the dark knight. Its awesome.

4) Split and feed the tweet into SWN3, aggregate the result

The problem with this approach is that

a) Negations should be handled outside SWN3.

b) Information in emoticons and exaggerated punctuations will be lost or they need to be handled separately.