Search code examples
machine-learningdocument-classification

Classifying website type from webpages


Are there any reliable/deployed approaches, algorithms or tools to tagging the website type by parsing some its webpages.

For ex: forums, blogs, PressRelease sites, news, E-Comm etc.

I am looking for some well-defined characteristics (Static rules) from which this can be determined. If not, then i hope Machine Learning model may help.

Suggestions/Ideas ?


Solution

  • If you approach this from machine learning standpoint, Naive Bayes classifier probably has the greatest work/payoff ratio. A version of it is used in Winnow to categorize news articles.

    You will need a collection of pages, each tagged with it's proper category. Then you extract words or other relevant elements from each page and use them as features

    Dr.Dobbs has an article on implementing Naive Bayes