Search code examples
common-lisphashtablesbcl

Is an EQ Hash-table Really More Efficient than an EQUAL Hash-table in SBCL?


I've always assumed that EQ is much faster than EQUAL for hash-tables. But a simple test gives contrary results. Any insights appreciated.

(defun random-string (n)
  "Generate a random string of length n."
  (let ((charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ"))
    (iter (repeat n)
          (collect (char charset (random (length charset)))
                   result-type string))))

Test first for EQUAL hash-table:

* (defparameter random-strings (iter (for i from 1 to 5000)
                                     (collect (random-string 5))))
RANDOM-STRINGS

* (defparameter ht (make-hash-table :test #'equal :size 10000))
HT

* (dolist (rs random-strings)
    (setf (gethash rs ht) t))
NIL

* (time (dotimes (i 1000000)
          (dolist (rs random-strings)
            (gethash rs ht))))

Evaluation took:
  14.420 seconds of real time
  8.703125 seconds of total run time (8.687500 user, 0.015625 system)
  60.35% CPU
  51,914,146,826 processor cycles
  0 bytes consed

Test next for EQ hash-table:

* (defparameter random-strings (iter (for i from 1 to 5000)
                                     (collect (intern (random-string 5)))))
RANDOM-STRINGS

* (defparameter ht (make-hash-table :test #'eq :size 10000))
HT

* (dolist (rs random-strings)
    (setf (gethash rs ht) t))
NIL

* (time (dotimes (i 1000000)
          (dolist (rs random-strings)
            (gethash rs ht))))

Evaluation took:
  15.309 seconds of real time
  9.500000 seconds of total run time (9.484375 user, 0.015625 system)
  62.06% CPU
  55,112,812,169 processor cycles
  0 bytes consed

Solution

  • [This answer is really an extended comment. I also have no recent experience of implementing these things and would welcome correction from people who do.]

    The answer is that it's complicated. It's complicated for at least two reasons:

    for an eq hashtable you might think you can get the hash 'for free' by computing it from the address of the object (or the object itself if it is immediate, like a fixnum). Except ... garbage collectors relocate objects all the time, so now do you have to recompute the hashes of the objects every time the garbage collector runs (which could be hundreds of times a second)? I don't know what people do about this, but I imagine it's a mass of complicated tradeoffs. The main point is that eq hashtables are not as simple as they look in the presence of copying garbage collectors, and that non-simplicity probably has costs.

    For an equal hashtable you probably need to compute a more complicated hash unless you want very many collisions. But computing the hash of a vector of immediate objects is some kind of best-case for modern machines, which absolutely love going linearly through memory and doing something to each thing they find. For short strings this is probably blindingly fast.

    So computing the hash of short strings may be extremely fast.

    And finally in both cases once you've computed the hash of the thing you look it up in whatever structure underlies the hashtable and you then need to check whether the thing that's there (if anything is there) is the same as the thing you've computed the hash of.

    For an eq hashtable that 'is it the same' check is pretty fast.

    But for an equal hashtable that check only may be slow. In particular it's pretty obvious that any kind of call to (equal a b) starts with (or (eq a b) ...). So, in particular if you have a case where your objects are either eq to each other or they are not equal then equal can succeed very quickly indeed (if it is going to fail this may be slower because then it has to do work of course).

    And this is just what you are doing: two random strings are very unlikely to be equal unless they are eq, and your code:

    1. fills a hashtable with strings;
    2. checks whether the same (eq) strings are there or not.

    But, well, they all are there, so probably in almost all cases you are hitting the good case of equal (the bad cases where would be two strings hash to the same bucket and at least one comparison then fails, but good hashtable designs will make this rare).