Search code examples
algorithmmachine-learningnlpperceptronpos-tagger

Creating a feature function for POS tagging


I am trying to use a Perceptron to perform supervised classification and thereby perform POS tagging of a sentence. I am assuming for now that the tags of each word is independent of the other. (i.e I am just using just the word as a feature). I am fairly new to Machine Learning algorithms, and so I am unable to figure out how to represent the feature function for each word.

I have a training set of 100 sentences, where each word is given a particular tag (say N, V, J(adjective) and so on). For instance,

Jack(N) and(&) Jill(N) went(V) to(PRP) Peru(N)

where the tags are in braces. Lets say I have a total of 10 possible tags. Now my question is how does the feature vector for the word Jack look like?

I am very much interested in implementing it as a vector, since my code will match the notation better. Once I figure out how the feature function looks like, I will be able to implement the Perceptron algorithm!

Also, say I want to add features like (a) Is first letter capitalized? (b) Is word hyphenated etc., How do I incorporate that into my feature vector?

Intuitively I can see that the vector needs to have only binary values, but I am unable to proceed beyond that.

Kindly try to explain with concrete examples if possible!


Solution

  • Use a dictionary which maps words to numeric ids. If your vocabulary has 10,000 items in it, your dictionary maps each word to a number in the range 0-9999 and every word is represented as a binary vector of length 10,000 where only one element is active: that corresponding to the word's id in the dictionary.

    If you want extra features besides word ids, you can just tack these on to the end of the feature vector: that is, you can make features 10,000+ be the capitalisation feature, the previous tag feature (will need binary coding as above) etc.

    As a final point, POS tagging is an instance of a structured prediction problem, rather than a series of independent classifications. If this becomes more than an academic exercise, you'll want to move to the structured perceptron, or another structured learning method like a CRF or struct-SVM.

    EDIT: a simple example

    Imagine I have a five word vocabulary, {the,cat,sat,on,mat}, and a reduced tagset {DET,N,V,PREP}. My sentence is thus:

    (The,DET) (cat,N) (sat,V) (on,PREP) (the,DET) (mat,N).

    Now I want a feature vector for each word, from which I would like to be able to predict the tag. I am going to use features 0-4 as my word id indicator functions, so that feature 0 corresponds to 'the', feature 1 to 'cat' and so on. This gives me the following feature vectors (with the intended 'class' or tag assignment following the ->):

    [1 0 0 0 0] -> DET
    [0 1 0 0 0] -> N
    [0 0 0 0 0] -> V
    ...
    

    I could treat these as instances and apply my learning algorithm of choice to this task, however, word ID functions alone won't buy me much. I decide I want to incorporate some HMM-like intuition into my classifications, so I also add feature functions which indicate what the previous tag was. So I use features 5-8 as indicators for this, with 5 corresponding to DET, 6 to N, and so on. Now I have the following:

    [1 0 0 0 0 0 0 0 0] -> DET (because this is the first word there's no previous tag)
    [0 1 0 0 0 1 0 0 0] -> N
    [0 0 0 0 0 0 1 0 0] -> V
    

    Now I can keep adding features to my heart's content, using for example feature 9 to indicate whether the word is capitalised or not, feature 10 might be whether the word matches a list of known proper nouns, etc etc. If you read a little about structured prediction tasks and methods, you should see why using a model customised for this task (easiest is an HMM, but I'd want to progress to a CRF/Structured Perceptron/StructSVM for state of the art performance) is superior to treating this as a bunch of independent decisions.