Search code examples
hashmapestimation

Calculating bytes per fetch of a hash


I'm reading Chapter Four of "Beautiful Code" from O'Reilly. The author of the chapter does a back-of-the-envelope calculation that I don't follow.

The author explains the creation of a hash with host names for keys and a list of timestamp and article name for the value. Something like,

{
"egspd42470.ask.com" : [1166406027, "2006/05/03/MARS-T-Shirt"],
"84.7.249.205" : [1166406040, "2003/03/27/Scanner"],
}

He states there are 2,345,571 host names, 12,600,064 instances of a fetch, and that "the program grew to occupy 1.56 GB of memory" while creating the hash. He then claims,

"A little calculation suggests that it costs around 680 bytes to store the information for each host, or slicing the data another way, about 126 bytes per fetch"

  1. How is "bytes per fetch" a relevant metric? Aren't hash maps static storage? That is, if the key were an 8-byte string, then wouldn't the bytes per fetch always be 8-bytes?

  2. How is he getting 680 bytes per host and 126 bytes per fetch? 1.56 / 2,345,571 ~= 665 bytes/host and 1.56 / 12.6 ~= 0.123 bytes per fetch, adjusting for units. However, neither match the number he gives. Using the exact numbers provided, as well as rounding them, I can produce neither of the given estimates.

In case it makes a difference, the hash he's analyzing is implemented in Ruby.


Solution

  • How is "bytes per fetch" a relevant metric?

    "bytes per fetch" is indeed a very unusual metric, and on the face of it unlikely to be of any use to the programmer, but perhaps could be used for marketing (i.e. to charge for each fetch so it helps pay for the RAM costs). That said, not having read the book myself, I can't rule out that he's described a scenario where there's a fetch operation inherently coupled with adding some data to the associated value array, or some other way in which there's per-fetch memory usage.

    Aren't hash maps static storage? That is, if the key were an 8-byte string, then wouldn't the bytes per fetch always be 8-bytes?

    A hash table's memory usage doesn't grow just because you do a hash lookup, if that's what you mean by "static". (They tend to use "dynamically allocated" rather than "static" memory, but that's probably not what you were discussing.)

    That is, if the key were an 8-byte string, then wouldn't the bytes per fetch always be 8-bytes?

    Again, "bytes per fetch" is a silly metric. Each time a fetch is done, the key being looked up has to be compared with any key or keys associated with the bucket to which the key hashed. So, say you get an 8-byte string key, then you hash it (which involves some calculation using the string content), then you find the bucket and iterate over any entries that were inserted there: for each one found you might first compare the hash values (if they've been saved in the hash table), or you might go straight to comparing the strings (which might start with comparing the string lengths, then if that's the same, scanning the string from one end while comparing with the other string). But, you might have to compare to several strings before you exhaust other strings that may have hashed to the same bucket. So, there's no exact answer for how many bytes of data have to be processed during a fetch. It will generally be of the same order of magnitude as the string length though, given the O(1) properties of hash tables.

    How is he getting 680 bytes per host and 126 bytes per fetch? 1.56 / 2,345,571 ~= 665 bytes/host and 1.56 / 12.6 ~= 0.123 bytes per fetch adjusting for units. However, neither match the number he gives. Using the exact numbers provided, as well as rounding them, I can produce neither of the given estimates.

    I agree with your 665 being an upper bound to the hash table usage per hostname, so his 680 is impossible and therefore wrong. The dubious bytes per fetch metric should be ~123.8 bytes. Perhaps when he was using the actual values instead of having already rounded to 1.56G, the "per fetch" metric came out closer to 126 bytes... not a big difference anyway.

    All that said, honestly - in general if you have to care about the memory usage of a hash table, populate your own data and see how the process usage grows. It'll be a bit different with different languages, data types, text encodings, hash table implemenations, and for compiled languages - release vs debug modes & optimisation levels/flags.