Search code examples
google-searchpagerank

What does this line mean in Brin/Page's 1998 paper?


I'm not sure if this belongs in SO, but I don't know what other SE site would be more appropriate.

In Brin and Page's paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine", they describe the variable d in the PageRank algorithm as the probability that a random surfer requests a new random page. On the next line, they state:

One important variation is to only add the damping factor d to a single page, or a group of pages. This allows for personalization and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking.

What does this mean? Why would they add the damping factor to only a single page?

Does it mean that the random surfer will keep following links until they arrive at a specific page? I don't think this makes sense because the random surfer can get stuck in a loop and never reach that specific page.


Solution

  • The Wikipedia article on PageRank kind of explains the overall concept of the damping factor, but this forum post explains it better.

    Eventually any web surfer is eventually going to get to a point where he/she gets bored and does something else. The theory (assuming I understood the Wikipedia article correctly) is that while looking at a given page, there's an 85% chance that a user will click on one of the links to another page. Therefore, the odds of viewing two pages in a row is 85%, the odds of viewing three pages is 72.25%, the odds of viewing four pages is 61.4%, etc.

    Thus, if you have page A linking to page B and page B linking to page C:

    A -> B -> C

    then the popularity of page A has an 85% chance of causing page B to become popular, but only a 72.25% chance of making page C popular, because there's a 15% chance of the user randomly going to some other website instead at each of those decision points.

    Without that fall-off, I think that every website in the world would eventually end up with an infinite page rank, because the page rank would propagate through every page to every other page like a tidal wave. By damping the progression of page rank strength at each step, you ensure that sites linked from high-ranking sites get a rank boost, but not sites that happen to be reachable via a hundred hops.

    The bit you're quoting is explained a bit more in their follow-on paper, in which they explain that they normally use a constant damping factor when calculating page rank, and assume that there is a 15% (1 - .85) probability of jumping to any arbitrary page in the world, with all possible pages getting those jumps equally, but that you can instead use a fixed set of target pages (or even a single web page) to receive all of those random jumps. When you do that, you end up computing a very different page rank based on proximity to that particular page or group of pages.

    For example, if the user has a particular page set as his/her browser's start page, you might assume that the user will click the home button and go back to that page when he or she gets bored. Thus, pages linked closely from that page would have a higher personal page rank for that user. You can create an even better personalized ranking by adding in things like the user's bookmarks, pages they visit frequently, etc. And because rankings based on those limited sets of "restart pages" are personalized in this way, they cannot be easily manipulated by commercial interests, because nobody is likely to buy links from one of the five particular pages that happen to be in your bookmarks (or home page or whatever).