php algorithm nlp data-mining stanford-nlp

proposed nlp algorithm for text tagging

I was looking for opensource tool which can help to identify the tags for any user post on social media and identifying topic/off-topic or spam comment on that post. Even after looking for entire day, I could not find any suitable tool/library.

Here I have proposed my own algorithm for tagging user post belonging to 7 categories (jobs, discussion, events, articles, services, buy/sell, talents).

Initially when user makes post, he tags his post. Tags can be like marketing, suggestion, entrepreneurship, MNC etc. So consider for some posts I have tags and to which category they belongs.

Steps:

Perform POS (part of speech) tagging on user post. Here two things can be done.
- considering only nouns. Nouns may represent the tag for post more intuitively I guess
- Considering Nouns and adjectives both. Here we can collect large numbers of nouns and adjectives. Frequency of such words can be used to identify tag for that post.
For each user defined tag, we will collect POS for that post belonging to particular tag. Example. Consider user assigned tag marketing and post for this tag contains POS words SEO and adwords. Suppose 10 post of marketing tag contains SEO and adwords 5 and 7 times respectively. So next time when user post comes which does not have any tag but contains POS word SEO. SEO is occurring maximum times 7 in marketing tag, So we will predict marketing tag for this post
NExt steps is for identify spam or off-topic comment for POST. Consider one user post for Job category. This post contains tag marketing. Now I will check in database for TOP most frequent 10-15 Part of speech tags(i.e. nouns and adjective) for marketing.

Parallel I have POS tag for that comment. I will check whether POS(noun & adj) of this post contains top most frequent tags(we can consider 15-20 such POS tags) belonging to marketing.

If POS in comments does not match with any of the most frequent, top POS for marketing then that comment can be said off-topic/span

DO YOU HAVE ANY SUGGESTION TO MAKE THIS ALGO MORE INTUITIVE??

I guess SVM can help for classification, any suggestion for this?

Apart from this WhIch machine learning technique can help here to learn system to predict tag and spam(off topic) comments

Solution

The main problem as I see it is with your feature modeling. While picking out only nouns would help reduce the feature space, it is an extra step with a potentially significant error rate. And do you really care whether you are looking at market/N and not market/V?

Most mainline text classification implementations using naive bayesian classifiers just ignore the POS, and simply count each distinct word form as an independent feature. (You could also do brute-force stemming to reduce market, markets, and marketing to a single stem form and thus a single feature. This tends to work in English, but might not be very adequate if you are actually working in a different language.)

A compromise could be to do POS filtering when you train your classifier. Then word forms which do not have a noun reading end up with a zero score in the classifier, so you don't have to do anything to filter them out when you use the resulting classifier.

Empirically, SVM tends to achieve a high accuracy, but it comes at the cost of complexity, both in implementation and behavior. A naive bayesian classifier has the distinct advantage that you can understand precisely how it arrived at a particular conclusion. (Well, most of us mortals cannot claim to have the same grasp of the mathematics behind SVM.) Perhaps a good way to proceed would be to prototype with Bayes, and iron out any kinks while learning how the system as a whole behaves, then maybe later consider switching to SVM once the other parts are stable?

The "spam" category is going to be harder than any well-defined content category. It would be tempting to suggest that anything which doesn't fit any of your content categories is off-topic, but if you are going to use the verdict for automatic spam filtering, this is likely to cause some false positives at least in the early stages. A possible alternative could be to train classifiers for particular spam categories -- one for medications, another for running shoes, etc.