I am trying to write a code which can take the source html of a web page then decide what kind of web page it is. I am intrested in deciding if the web page is about academic courses or not. A naive first approach that I have is to check if the text has words which can be related like (course, instructor, teach,...) and decide that it is about an academic course if it achieves enough hits.
Even though, I need some ideas how to achieve that more efficiently.
Any ideas would be appreciated.
Thanks in advance :)
Sorry for my English.
There are many approaches to classifying a text, but first: a web page should be converted to plain text either using a dump way of removing all the HTML tags and reading what's left, or using smarter ways of identifying the main parts of the page that would contain all the useful text, in the latter case you can use some HTML5 elements like <article>
, read about the HTML5 structural elements here.
Then you can try any of the following methods, depending on really how far you are willing to go with your implementation:
All the above depends on you setting a defined list of keywords that you would base your decision on. But life doesn't work that way usually, that's why we use machine learning. And the basic idea would be that you would get a set of documents and manually tag/categorize/classify them, and then feed those documents to your program as a training set and let your program learn on them, afterwards your program would be able to apply what it learned in tagging other untagged documents. If you decide to go with this option then you can check this SO question and this Quora question and the possibilities are endless.
And assuming you speak Arabic I would share a paper of the project I worked on here if you're interested, but it is in Arabic and deals with the challenges of classifying Arabic text.