python-2.7 classification nltk text-classification

Python NLTK named entity recognition depends by the (upper)case of first letter?

I am planning on using Python NLTK for academic research. In particular, I need a way of screening the Twitter users and tease out the ones who do not seem to be using a "real name" in their profile.

I am thinking about using the default NLTK's name-entity recognition to separate the Twitter users who use seemingly real name from those who aren't. Do you think it's worth the try? Or should I train the classifier by myself?

import nltk
import re
import time

##contentArray0 =['Health Alerts', "Kenna Hill"]

contentArray =['ICU nurse toronto']

##let the fun begin!##
def processLanguage():
    try:
        for item in contentArray:
            tokenized = nltk.word_tokenize(item)
            tagged = nltk.pos_tag(tokenized)
            print tagged

            namedEnt = nltk.ne_chunk(tagged)
            ##namedEnt.draw()

            time.sleep(1)

    except Exception, e:
        print str(e)


processLanguage()

Edit: I have done a bit of testing. It seems nltk recognizes a name entity primarily by whether or not the first letter of the word is capital? For example, "ICU Nurse Toronto" will be recognized with NNP while "ICU nurse toronto" will not. It seems overly-simplistic and not very useful for my purpose (twitter) since many Twitter users using real name could be using lower case while some commercial organization will be using capital first letter.

Solution

Definitely train one yourself. The NLTK's NE recognizer is trained to recognize named entities embedded in full sentences. But don't just retrain the nltk's NE recognizer on new data; it is a "sequential classifier", meaning it takes into account the surrounding words and POS tags and the named-entity classification of the preceding words. Since you already have the usernames, these will not be useful or relevant for your purposes.

I suggest you train a regular classifier (e.g., Naive Bayes), feed it whatever custom features you think might be relevant, and ask it to decide "is this a real name". To train, you must have a training corpus that contains examples of names and examples of non-names. Ideally the corpus should consist of what you're trying to classify: twitter handles.

Re the question in your comment, don't use entire words as features: your classifier can only reason with features it knows about, so census names can't help you with novel names unless your features are about parts of the name. Usually the features represent the endings (last letter, final bigram, final trigram), but you can try other things too like length and of course capitalization. The NLTK chapter discusses the task of recognizing the gender of names, and gives many examples of suffix features.

The catch, in your case, is that you have multiple words. So your classifier needs to be told somehow if some words are recognized as names and some are not. Somehow you must define your features in a way that preserves this information. E.g., you could set the feature "known names" to have the values "None", "One", "Several", "All". (Note that the NLTK's implementation treats feature values as "categories": They are simply distinct values. You can use 3 and 4 as feature values, but as far as the classifier is concerned you might as well have used "green" and "elevator".)

And don't forget to add a "bias" feature with constant value (see the NLTK chapter).