Search code examples
algorithmdata-structureshashindexingsearch-engine

How does hashing of entire content of a web page work?


I have sometimes heard esp in context of information retrieval, search engines, crawlers etc that we can detect duplicate pages by hashing content of a page. What kind of hash functions are able to hash an entire web page (which are at least 2 pagers), so that 2 copies have same hash output value? What is size of a typical hash output value?

Are such hash functions able to put 2 similar web pages with slight typos etc in the same bucket?

Thanks,


Solution

  • Any hash function, given two inputs x and y s.t. x = y, will by definition return the same value for them. But if you want to do this kind of duplicate detection properly, you will need either:

    • a cryptographically strong hash function such as MD5, SHA-1 or SHA-512, which will practically never map two different pages to the same value so you can assume an equal hash value means equal input, or
    • a locality sensitive hash function if you want to detect near-duplicates.

    Which one to use really depends on your needs; crypto hashes are useless in near-duplicate detection, since they're designed to map near-duplicates to very different values.