Our site has user-generated content and a user can use hashtags to categories their content. To make searching for content easier, we are thinking about creating "Interest" categories like:
Sex, Hobbies, Current Events, etc.
One way to achieve this would be to associate keywords with each interest category. So, if a user clicks on Hobbies, the system will search for the keywords we've associated with Hobbies like:
Hobbies -> cars, cooking, reading, etc.
However, this method seems limited since a user can post a picture of a hotrod with the words "sexy" in the body and with our system the word "sexy" is associated with two interest categories: "Sex" and "Fashion & Beauty".
Any suggestions on how to make this method smarter? Or, suggestions/advice on how companies would implement something like this?
Probably you should weight the categories. Find all the matching words, and assign a value to all categories as follows:
It is a biased weighting (towards unique words), this way you can better decide where the pictures belong to.
Also, you can build a - continuously changing - weight-matrix, that which word is how relevant to a certain category. The frequent words bear less importance (because everybody is using them).
Also, based on the categorized texts, you can automatically extend the word-list, and automatically categorizing them. For example, if a new game name appears in the word-list (call it 'abc'), you will notice that 'abc' appears in a lot of texts in the hobby category, and nowhere else. So, you can tie this word to this category.
It's a very exciting area to build auto-learning systems!