Search code examples
machine-learningtext-classification

Use pos tagging in bag of words


I'm using the bag of words for text classification. Results aren't good enough, test set accuracy is below 70%.

One of the things I'm considering is to use POS tagging to distinguish the function of words. How is the to go approach to doing it?

I'm thinking on append the tags to the words, for example the word "love", if it's used as a noun use:

love_noun

and if it's a verb use:

love_verb

Solution

  • Test set accuracy near 70% is not that bad if you have hundreds of categories. You might want to measure overall precision and recall instead of accuracy.

    What you proposed sounds good, which is an approach to add feature conjunctions as additional features. Here are a few suggestions:

    Still keep your original features. That is to say, don't replace love with love_noun or love_verb. Instead, you have two features coming from love:

     love, love_noun (or)
     love, love_verb
    

    If you need some sample code, you can start from nltk python package.

    >>> from nltk import pos_tag, word_tokenize
    >>> pos_tag(word_tokenize("Love is a lovely thing"))
    [('Love', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('lovely', 'JJ'), ('thing', 'NN')]
    

    Consider using n-grams, maybe starting from adding 2-grams. For example, you might have "in" and "stock" and you might just remove "in" because it is a stop-word. If you consider 2-grams, you will get a new feature:

    in-stock
    

    which has a different meaning to "stock". It might help a lot in certain cases, for example, to distinguish from "finance" from "shopping".