Search code examples
regexnlpjsoupsemantic-web

How to decide if a webpage is about a specific topic or not?


I am trying to write a code which can take the source html of a web page then decide what kind of web page it is. I am intrested in deciding if the web page is about academic courses or not. A naive first approach that I have is to check if the text has words which can be related like (course, instructor, teach,...) and decide that it is about an academic course if it achieves enough hits.

Even though, I need some ideas how to achieve that more efficiently.

Any ideas would be appreciated.

Thanks in advance :)

Sorry for my English.


Solution

  • There are many approaches to classifying a text, but first: a web page should be converted to plain text either using a dump way of removing all the HTML tags and reading what's left, or using smarter ways of identifying the main parts of the page that would contain all the useful text, in the latter case you can use some HTML5 elements like <article>, read about the HTML5 structural elements here.

    Then you can try any of the following methods, depending on really how far you are willing to go with your implementation:

    • Like you mentioned, a simple search for relative words, but that would give you a very low success rate.
    • Improve the solution above by passing the tokens of the texts to a lexical analyzer and focus on the nouns, nouns usually have the highest value - I will try to find the resource of this but I'm sure I read it somewhere while implementing a similar project -, this might improve the rate a little.
    • Improve more by looking at the origin of the word, you can use a Morphological Analyzer to do so, and this way you can tell that the word "papers" is the same as "paper". That can improve a little.
    • You can also use an ontology of words like Word Net, and you can then start looking whether the words in the document are descendants of one of the words you're looking for, or the other way around but going up means genaralizing which would affect the precision. e.g. you can tell that the word "kitten" is related to the word "cat" and so you can assume that since the document talks about "kittens" then it talks about "cats".

    All the above depends on you setting a defined list of keywords that you would base your decision on. But life doesn't work that way usually, that's why we use machine learning. And the basic idea would be that you would get a set of documents and manually tag/categorize/classify them, and then feed those documents to your program as a training set and let your program learn on them, afterwards your program would be able to apply what it learned in tagging other untagged documents. If you decide to go with this option then you can check this SO question and this Quora question and the possibilities are endless.

    And assuming you speak Arabic I would share a paper of the project I worked on here if you're interested, but it is in Arabic and deals with the challenges of classifying Arabic text.