Search code examples
packagecommon-lisphashtablesbcl

Using a Package as a Hash Table in Common Lisp


Would it be practicable to initially store a lot of symbols in a package (separate from a project package), and effectively use it as a hash table (where the keys indicate simple set membership data) accessible by the project?

* (defpackage :project (:use :cl))
#<PACKAGE "PROJECT">

* (defpackage :temp (:use :cl))
#<PACKAGE "TEMP">

* (in-package :project)
#<PACKAGE "PROJECT">

* (intern "ABC" :temp)
TEMP::ABC
NIL

* (find-symbol "ABC" :temp)
TEMP::ABC
:INTERNAL

;The following is not needed for this project, but is available

* (setf (symbol-value (find-symbol "ABC" :temp)) 123)
123

* (symbol-value (find-symbol "ABC" :temp))
123

This seems to work, but are there good reasons to avoid and just use a hash table instead? My first thought was to avoid cluttering the project package with a lot of misc key symbols. These symbols would be generated from strings at runtime (to check set membership in the hash table). This would seem to involve runtime interning of many symbols that may or may not be stored in the table. But I'm also wondering if find-symbol could be more efficient than gethash. Alternately, I could just use an equal hash table for the strings, but the sheer number of lookups seems to point to eq rather than equal. Looking for any informed advice, thanks.

Edit: Simple tests comparing equal hash-table with package symbol lookups.

First, hash-table test

* (defun random-string (n)
  "Generate a random string of length n."
  (let ((charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ"))
    (iter (repeat n)
          (collect (char charset (random (length charset)))
                   result-type string))))
RANDOM-STRING

* (defparameter *ht* (make-hash-table :test #'equal :size 10000))
*HT*

* (iter (for i from 0 to 5000)
      (setf (gethash (random-string 5) *ht*)
            t))
NIL

* (time (dotimes (i 1000000)
        (gethash (random-string 5) *ht*)))

Evaluation took:
  0.150 seconds of real time
  0.109375 seconds of total run time (0.109375 user, 0.000000 system)
  72.67% CPU
  541,959,942 processor cycles
  127,937,760 bytes consed

Next package lookup

* (defpackage :temp1)
#<PACKAGE "TEMP1">

* (iter (for i from 0 to 5000)
      (intern (random-string 5) :temp1))
NIL

* (time (dotimes (i 1000000)
        (find-symbol (random-string 5) :temp1)))

Evaluation took:
  0.224 seconds of real time
  0.171875 seconds of total run time (0.156250 user, 0.015625 system)
  [ Run times consist of 0.015 seconds GC time, and 0.157 seconds non-GC time. ]
  76.79% CPU
  807,944,162 processor cycles
  127,954,624 bytes consed

Hash-table is about 1.5 times faster


Solution

  • You can do this but it is both stylistically awful and almost certainly slow.

    In particular please understand that if you have a string which has been created at runtime and you want to map it to a unique object in such a way that all strings which contain the same sequence of characters map to the same object you must consider each element of the string: in other words you must do what equal does for strings. There is no secret magic trick which allows you to use eq on strings.

    So to do what you want to do, which is informally 'interning' strings, you can compare the options you are considering:

    Hash tables are first-class objects, and do exactly and only what you want, which is what they are designed to do.

    Packages are not completely first-class: you will have to globally manage the package(s) you use. Packages almost certainly use a hash-table or some equivalent structure internally. Packages map strings to symbols which are predefined objects which are probably rather heavyweight. Packages are not designed to do what you want although can be abused to do so.

    In practice what this means is that packages will almost certainly be much slower, allocate much more memory, and your code will be ugly as shit.

    I did some brief tests, comparing creating a large number of symbols in a fresh package (ie intern a large number of unique strings) compared with adding those same strings to an equal hashtable. This is slower by a factor of about 10 in LispWorks and about 6 in SBCL.


    One possible counterexample to this is where you have a scenario like this:

    • I have a large number of collections of objects which contain something like strings.
    • for a given string I wish to now check it against this large number of objects, as quickly as possible.

    So what I'd now like to do is to store, in these many collections, objects which represent strings in some unique way, which will save a large number of string comparisons.

    The traditional approach to that would be to use symbols, either by a package or (probably better, today) by a hashtable:

    (defvar *string-table*
      ;; Note we're gong to case-fold here.
      (make-hash-table :test #'equalp))
    
    (defun canonicalize-string (s &key (table *string-table*) (upper-case t))
      (or (gethash s table)
          (setf (gethash s table) (make-symbol (if upper-case (string-upcase s) s)))))
    

    Now you need to call canonicalize-string (which is like find-symbol / intern) whenever you want the symbol for a string, and you can then write, for instance:

    (let ((s (canonicalize-string ...)))
      (dolist (set my-large-collection-of-sets nil)
        (when (present-in-set-p s set)
          (return set))))
    

    Except, wait. If you're only using the symbols as unique avatars for strings, why use symbols?

    (defun canonicalize-string (s &key (table *string-table*) (upper-case t))
      (or (gethash s table)
          (setf (gethash s table) (if upper-case (string-upcase s) s))))
    

    Now this code will allocate less but still do exactly the same thing as the original code did.


    As a final note, it is very easy to define lightweight interned-strings, complete with a literal syntax for them:

    (in-package :cl-user)
    
    (defvar *interned-strings-table* nil)
    
    (defstruct interned-strings-table
      (parent *interned-strings-table*)
      (table (make-hash-table :test #'equal)))
    
    (setf *interned-strings-table*
      (make-interned-strings-table :parent nil))
    
    (defun intern-string (s &optional (table *interned-strings-table*))
      (do ((current table (interned-strings-table-parent current)))
          ((not current)
           (setf (gethash s (interned-strings-table-table table)) s))
        (let ((it (gethash s (interned-strings-table-table current))))
          (when it (return it)))))
    
    (defun clear-interned-strings (&optional (table *interned-strings-table*))
      (do ((current table (interned-strings-table-parent current)))
          ((not current) t)
        (clrhash (interned-strings-table-table current))))
    
    (defun read-interned-string (reader stream char)
      (let ((s (funcall reader stream char)))
        (typecase s
          (string
           (intern-string s))
          (t
           (error "oops")))))
    
    (defun make-string-interning-readtable (&optional (from *readtable*) (to nil))
      (let ((strt (copy-readtable from to)))
        (when (get-dispatch-macro-character #\# #\" strt)
          (error "Someone is already using #\""))
        (set-dispatch-macro-character
         #\# #\"
         (let ((string-reader (get-macro-character #\" strt)))
           (lambda (stream char prefix)
             (declare (ignore prefix))
             (read-interned-string string-reader stream char)))
         strt)
        strt))
    

    Now with *readtable* set to the result of (make-string-interning-readtable):

    > (eq #"foo" #"foo")
    t
    
    > (eq #"foo" (intern-string "foo"))
    t
    
    > (eq #"foo" "foo")
    nil                                     ;in fact maybe
    
    > (eq "foo" "foo")
    nil                                     ;in fact maybe