I'm using the bag of words for text classification. Results aren't good enough, test set accuracy is below 70%.
One of the things I'm considering is to use POS tagging to distinguish the function of words. How is the to go approach to doing it?
I'm thinking on append the tags to the words, for example the word "love", if it's used as a noun use:
love_noun
and if it's a verb use:
love_verb
Test set accuracy near 70% is not that bad if you have hundreds of categories. You might want to measure overall precision and recall instead of accuracy.
What you proposed sounds good, which is an approach to add feature conjunctions as additional features. Here are a few suggestions:
Still keep your original features. That is to say, don't replace love
with love_noun
or love_verb
. Instead, you have two features coming from love
:
love, love_noun (or)
love, love_verb
If you need some sample code, you can start from nltk
python package.
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("Love is a lovely thing"))
[('Love', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('lovely', 'JJ'), ('thing', 'NN')]
Consider using n-grams, maybe starting from adding 2-grams. For example, you might have "in" and "stock" and you might just remove "in" because it is a stop-word. If you consider 2-grams, you will get a new feature:
in-stock
which has a different meaning to "stock". It might help a lot in certain cases, for example, to distinguish from "finance" from "shopping".