Search code examples
arrayscommon-lispbitarraybitvector

Determining if a bit array is in a collection of bit arrays


Given 2-dimensional NxN bit arrays, I'm trying to evaluate the best way to determine if a bit array is already in a large collection of previously seen bit arrays.

A straightforward approach would put the bit arrays in a hash table. But to compare arrays requires an #'equalp :test function, which may not be very efficient. (But maybe SBCL automatically optimizes for different key types?)

Another plan is to convert all bit arrays to integers, and put the integers in a hash table. Then the test could be #'eql:

(defun bit-arr-to-int (bit-array)
  (reduce (lambda (bit1 bit2)
            (+ (* bit1 2) bit2))
          (make-array (array-total-size bit-array)
                      :displaced-to bit-array)))

Not sure, though, if this would end up being more efficient than letting SBCL handle things, since it still individually processes every element. Maybe a customized hash table would offer an efficiency advantage?

A third option might involve changing the basic representation from bit array to simple bit vector (aka integer), since the dimensions of the original bit array are known. To allow array-equivalent element references, this would require a function that translates an implicit array row,col coordinate to an explicit simple bit vector index. It may be more efficient to compute the indexes as needed, than convert a whole bit array to an integer for each hash table lookup, as above.

Appreciate some experienced insights.


Solution

  • I think the only way to know this is to try things and see what is fastest.

    One horrible trick you can try is to represent bit-vectors as a cons of (row-length . integer-of-bits). You need the row length so you can compute the offset in the integer of bits. Then you can do something like this (these may be wrong because I'm always confused by dpb and ldb):

    (deftype index ()
      `(integer 0 ,most-positive-fixnum))
    
    (defun make-bv (columns)
      (declare (type index columns))
      (cons columns 0))
    
    (defun bv-ref (bv row column)
      (declare (type (cons index integer) bv)
               (type index row column))
      (let ((columns (car bv)))
        (assert (< column columns) (column) "column out of range")
        (ldb (byte 1 (+ (* row columns) column)) (cdr bv))))
    
    (defun (setf bv-ref) (bit bv row column)
      (declare (type bit bit)
               (type (cons index integer) bv)
               (type index row column))
      (let ((columns (car bv)))
        (assert (< column columns) (column) "column out of range")
        (setf (cdr bv) (dpb bit (byte 1 (+ (* row columns) column)) (cdr bv)))
        bit))
    

    In real life you want to inline these things of course.

    A representation like this is possibly good for hashing (you can just hash on equal, or even have nested eql hashtables for the two elements in the cons) with the notable problem that it doesn't care how many rows the object has: they all essentially have an unbounded number of rows.

    It's awful if the 'arrays' are significantly large and you change bits in them a lot because it will cons a new bignum each time you change a bit.

    But I think the only way to know is to measure.