Search code examples
machine-learningclassificationsvmlibsvm

help organizing my data for this machine learning problem


I want to categorize tweets within a given set of categories like {'sports', 'entertainment', 'love'}, etc...

My idea is to take the term frequencies of the most commonly used words to help me solve this problem. For example, the word 'love' shows up most frequently in the love category but it also shows up in sports and entertainment in the form of "I love this game" and "I love this movie".

To solve it, I envisioned a 3-axis graph where the x values are all the words used in my tweets, the y values are the categories, and the z values are the term frequencies (or some type of score) with the respect to the word and the category. I would then break up the tweet onto the graph and then add up the z values within each category. The category with the highest total z value is most likely the correct category. I know this is confusing, so let me give an example:

The word 'watch' shows up a lot in sports and entertainment ("I am watching the game" and "I am watching my favorite show") ...Therefore, I narrowed it down to these two categories at the least. But the word 'game' does not show up often in entertainment and show does not show up often in sports. the Z value for 'watch' + 'game' will be highest for the sports category and 'watch' + 'show' will be highest for entertainment.

Now that you understand how my idea works, I need help organizing this data so that a machine learning algorithm can predict categories when I give it a word or set of words. I've read a lot about SVMs and I think they're the way to go. I tried libsvm, but I can't seem to come up with a good input set. Also, libsvm does not support non-numeric values, which is adding more complexity.

Any ideas? Do I even need a library, or should I just code up the decision-making myself?

Thanks all, I know this was long, sorry.


Solution

  • Well you are trying to do text classification into a group of categories. Naive Bayes does this. In fact, it is a statistical analogue of your idea. It assumes that frequency of words in a text are independent indicators of a category and gives a probability of each category based on this assumption. It works well in practice; I believe Weka has an implementation.