Search code examples
sortingsearch-engine

How does a search engine rank millions of pages within 1 second?


I understand the basics of search engine ranking, including the ideas of "reverse index", "vector space model", "cosine similarity", "PageRank", etc.

However, when a user submits a popular query term, it is very likely that millions of pages containing this term. As a result, a search engine still needs to sort these millions of pages in real time. For example, I just tried searching "Barack Obama" in Google. It shows "About 937,000,000 results (0.49 seconds)". Ranking over 900M items within 0.5 seconds? That really blows my mind!

How does a search engine sort such a large number of items within 1 second? Can anyone give me some intuitive ideas or point out references?

Thanks!

UPDATE:

  1. Most of the responses (including some older discussions) so far seem to contribute the credit to "reverse index". However, as far as I know, reverse index only helps find the "relevant pages". In other words, by inverse index Google could obtain the 900M pages containing "Barack Obama" (out of over several billions of pages). However, it is still not clear how to "rank" these millions of "relevant pages" based on the threads I read so far.
  2. MapReduce framework is unlikely to be the key component for real-time ranking. MapReduce is designed for batch tasks. When submitting a job to a MapReduce framework, the response time is usually at least a minute, which is apparently too slow to meet our request.

Solution

  • One possible strategy is just rank the top-k instead of the entire list.

    For example, to find the top 100 results from 1 millions hits, by selection algorithm the time complexity is O(n log k). Since k = 100 and n = 1,000,000, in practice we could ignore log(k).

    Now, you only need O(n) to obtain the top 100 results out of 1 million hits.