Identifying content-defining keys in URL query strings

I'm (kind of) crawling URLs, many of them identifying the content using query strings, e.g., YouTube videos. Often only a small subset of keys used in the query string are used to identify the content, sometimes the whole query string is not important at all. For example, in most YouTube URLs, query string key v specifies the video, while hd, if present and set to 1, plays the same video in HD.

I now like to find out which of the key in the query string are indeed important for the content. For this, I currently compare the page corresponding to a original URL (e.g., http://www.youtube.com/watch?v=kA0pkemJxMc&hd=1) to the pages I receive if I remove each individual query string key step by step (http://www.youtube.com/watch?v=kA0pkemJxMc and http://www.youtube.com/watch?hd=1). If the pages are the same, I argue that the key was not important.

With that the question is, when are two pages the same? For the time being I test two things: (a) If the title of the pages differ, I assume the pages differ. It's often good enough, but I already stumbled over various websites that always use the same generic title. (b) I extract the visible text for both pages and calculate the top-k most frequent words. If these two sets differ, I assume the pages differ. Works also not that bad, but many pages contain dynamic content (e.g., latest tweets of Facebook messages in a sidebar DIV or something), thus effecting the set of most frequent words.

I guess that there's no 100% fool-proof way to determine the important, i.e., content-defining (maybe this is even open for interpretation), query string keys. However, I wonder how I could improve my mechanism.

Solution

Christian, you have an interesting problem! :-)

You could use the Canonical-Tag as an additional hint. If it is used on the website you are testing AND if it is correctly implemented (which you may check manually before, per page). At first: you would check its target location on the “original URL”. Second step: any page that is returned after skipping some parameters and still pointing to the same canonical URL has probably the same main content.

An additional idea: You already working with term frequency arrays. In my experience you could use something like the levenshtein distance on the top X elements of that arrays to get a softer score on how similar the two documents are. You could define (after some experiments) a threshold of similar enough / not similar enough (or even per page, depending on your exact goal).

[EDIT]

Well, it may not so simple to implement (that depends on your programming skills an experiences) but after reflecting for a while I think you will get the best results if you check the document similarity with help of the term vector model. You could even improve on that if you include HTML-tags to the vectors (in your setting you want to test for the similarity of the whole documents).