Search code examples
databasehashhashtableconsistent-hashing

Could anyone help provide a tutorial on hashing?


Recently I've read some papers on hashing techniques. It seems that hashing is everywhere.

In computer science, the hash table is commonly used as a efficient look-up data structures.

In encryption, the hashing is in the techniques such as md5 hash, sha hash, etc.

In the database area. The hashing is to build the key of the table in databases.

In machine learning, the hashing is to create short hash codes for efficient processing and economical storage, such as locality sensitive hashing, min-hash, sim-hash, hashing trick, and so on.

What are the same and different points of these applications on hashing? Could you help provide some readings or references on these hashing? Especially the differences on them. I'm confused on these hashing techniques.


Solution

  • I think the essential point of hashing is the ability to take a group of content that is variable length, dynamic in nature, and asynchronous, and be able to apply an algorithm to each member of that content that results in a "stable", fixed-sized, and essentially unique identifier for each. That is the point of most of the examples you cited:

    • Hash Tables: transform a variable length key string or structure into a "stable" unique identifier with known lower and upper bounds (aka row numbers in an array, addresses of rows in an array, row numbers in a database).
    • Cryptography: transform a variable length plain-text into a stable, unique, and fixed-length identifier.
    • Machine learning (at least the Hashing Trick): transform words (and perhaps their context) into a stable and unique key into a universal numerically organized ontology

    In all these cases you are making a small summary of the variable length content within each member of the group. Those small summaries make it much easier to deal with all the variable length content, and in the cases of hash tables can significantly speed up processing. Or especially in the case of cryptography can provide significant benefits, such as password protection (when using proper keyed and repetitive hashing) or content integrity verification.

    You will note that hashes almost always result in the potential for collisions: e.g. two completely different members of the group with different content yet the hash algorithm generates the same summary/hash value. A critical part of the design of the hash function is to determine the acceptable level of duplications allowed, and in the design of the hash implementation to properly deal with a collision when it happens. For a hash table using only a small amount of RAM the collision rate may be high. Using 256 bit crypto-hashing functions, the probability of collision may be effectively zero.

    Also, hashing is almost always "one way". Most hashing algorithms are deliberately "lossy" (which is why duplicates happen), and because of that one usually cannot reverse calculate the original variable length content from just the summary/hash value. There are brute force ways around that, but simple and fast reverse calculation is usually not possible.

    Note that we use "hashing algorithms" in our real lives as well. We use first names of coworkers in large companies as a convenience in talking/ emailing/ chatting (a trivial hash) even though there will certainly be many coworkers with the same first name. And thus collisions happen ("Do you mean Mary in Accounting or Mary in Shipping?"). You may "hash" all the known products of facial tissue into the word "Kleenex" (at least in the U.S.), yet still prefer to buy and use a different brand.