Search code examples
searchwebsearch-engineweb-crawler

Search engine components


I'm a middle school student learning computer programming, and I just have some questions about search engines like Google and Yahoo.

As far as I know, these search engines consist of:

  1. Search algorithm & code (Example: search.py file that accepts search query from the web interface and returns the search results)

  2. Web interface for querying and showing result

  3. Web crawler

What I am confused about is the Web crawler part.

Do Google's and Yahoo's Web crawlers immediately search through every single webpage existing on WWW? Or do they: First download all the existing webpages on WWW, save them on their huge server, and then search through these saved pages??

If the latter is the case, then wouldn't the search results appearing on the google search results be outdated, Since I suppose searching through all the webpages on WWW will take tremendous amount of time??

PS. One more question: Actually.. How exactly does a web crawler retrieve all the web pages existing on WWW? For example, does it search through all the possible web addresses, like www.a.com, www.b.com, www.c.com, and so on...? (although I know this can't be true)

Or is there some way to get access to all the existing webpages on world wide web?? (sorry for asking such a silly question..)

Thanks!!


Solution

  • The crawlers search through pages, download them and save (parts of them) for later processing. So yes, you are right that the results that search engines return can easily be outdated. And a couple of years ago they really were quite outdated. Only relatively recently Google and others started to do more realtime searching by collaborating with large content providers (such as Twitter) to get data from them directly and frequently but they took the realtime search again offline in July 2011. Otherwise they for example take notice how often a web page changes so they know which ones to crawl more often than others. And they have special systems for it, such as the Caffeine web indexing system. See also their blogpost Giving you fresher, more recent search results.

    So what happens is:

    • Crawlers retrieve pages
    • Backend servers process them
      • Parse text, tokenize it, index it for full text search
      • Extract links
      • Extract metadata such as schema.org for rich snippets
    • Later they do additional computation based on the extracted data, such as
      • Page rank computation
    • In parallel they can be doing lots of other stuff such as
      • Entity extraction for Knowledge graph information

    Discovering what pages to crawl happens simply by starting with a page and then its following links to other pages and following their links, etc. In addition to that, they have other ways of learning about new web sites - for example if people use their public DNS server, they will learn about pages that they visit. Sharing links on G+, Twitter, etc.

    There is no way of knowing what all the existing web pages are. There may be some that are not linked from anywhere and noone publicly shares a link to them (and doesn't use their DNS, etc.) so they have no way of knowing what these pages are. Then there's the problem of the Deep Web. Hope this helps.

    Crawling is not an easy task (for example Yahoo is now outsourcing crawling via Microsoft's Bing). You can read more about it in Page's and Brin's own paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine

    More details about storage, architecture, etc. you can find for example on the High Scalability website: http://highscalability.com/google-architecture