TweetNLP provides tokenizer and part-of-speech tagger for tweets, which is really cool. Now, I wonder if I can take it a step further and extract acronyms. For example, when I get a tweet "ikr", I would be able to look it up and get "I know, right?". I guess I can write my own dictionary, but it seems that there should already be one?
So what I end up doing is to use StanfordNLP with GATE tweeter model.
Sample tweet:
ikr smh he asked fir yo last name so he can add u on fb lololol
Results without gate-EN-twitter.model
word: ikr :: pos: NN :: ne:O
word: smh :: pos: NN :: ne:O
word: he :: pos: PRP :: ne:O
word: asked :: pos: VBD :: ne:O
word: fir :: pos: NNP :: ne:O
word: yo :: pos: NNP :: ne:O
word: last :: pos: JJ :: ne:O
word: name :: pos: NN :: ne:O
word: so :: pos: IN :: ne:O
word: he :: pos: PRP :: ne:O
word: can :: pos: MD :: ne:O
word: add :: pos: VB :: ne:O
word: u :: pos: NN :: ne:O
word: on :: pos: IN :: ne:O
word: fb :: pos: NN :: ne:O
word: lololol :: pos: NN :: ne:O
Results with gate-EN-twitter.model
word: ikr :: pos: UH :: ne:O
word: smh :: pos: UH :: ne:O
word: he :: pos: PRP :: ne:O
word: asked :: pos: VBD :: ne:O
word: fir :: pos: IN :: ne:O
word: yo :: pos: PRP$ :: ne:O
word: last :: pos: JJ :: ne:O
word: name :: pos: NN :: ne:O
word: so :: pos: IN :: ne:O
word: he :: pos: PRP :: ne:O
word: can :: pos: MD :: ne:O
word: add :: pos: VB :: ne:O
word: u :: pos: PRP :: ne:O
word: on :: pos: IN :: ne:O
word: fb :: pos: NNP :: ne:O
word: lololol :: pos: UH :: ne:O
Now, I am able to identify slang by looking at the tag of UH and go against my custom dictionary.
Still puzzled why it was not already available out there, but it solves my issue at the moment.