web-crawler bigdata cluster-computing search-engine data-storage

How to save content of bilion websites found by search engine (how google is doing it)

In their original Google Paper Sergey Brin and Lawrence Page explained that they didn't saved HTML content of the crawled webpages directly in repository because they wanted to save some HDD space. Here is that paragraph:

4.2.2 Repository

The repository contains the full HTML of every web page. Each page is compressed using zlib (see RFC1950). The choice of compression technique is a tradeoff between speed and compression ratio. We chose zlib's speed over a significant improvement in compression offered by bzip. The compression rate of bzip was approximately 4 to 1 on the repository as compared to zlib's 3 to 1 compression. In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL as can be seen in Figure 2. The repository requires no other data structures to be used in order to access it. This helps with data consistency and makes development much easier; we can rebuild all the other data structures from only the repository and a file which lists crawler errors.

Obviously they used compression algorithm (in their case zlib) to first compress the data and then save it in the repository. Compressed data is actually binary data that can be saved directly on filesystem. The metadata (page title, page size, links etc.) could be saved in DB with link on the binary file on the file system. This sounds like a good idea, but if we are talking about search engine that crawls bilion of pages, then this way of saving data could have some drawbacks.

What would be the best approach today? If you want to build big scale search engine that will handle content of milion of websites, where and how would you save HTML content of the crawled pages?

Solution

If you want to build big scale search engine that will handle content of milion of websites, where and how would you save HTML content of the crawled pages?

For the type of data you are talking about, your best bet would be to use one of the distributed file system. Google itself created the Google File System, a distributed fault-tolerant file system for this purpose.