Search code examples
multithreadingcommon-lisphashtablesbcl

Accessing a Global Hashtable from Concurrent Threads (Common Lisp)


I cannot find the bug related to the global hash-table in the following SBCL code:

(defparameter *lci-hash-table* (make-hash-table :size 10000))  
  "Hash-table key = length|char|index concatenated together (eg, 5A0),
   hash-table values = list of matching symbols (eg, (ABATE ABBEY ...))."

(defun dictionary-compatible ($new-cross-str)  ;eg, "A     " of length 5
  "Tests if a string (with uppercase alphabetic and space characters)
   is compatible with the dictionary words (symbols) in a hash-table."
  (iter (with len = (length $new-cross-str))
        (for char in-sequence $new-cross-str)
        (for index from 0)
        (when (not (eql char #\Space))
          (collect (gethash (ut::intern-symbol len char index)  ;this gethash
                            *lci-hash-table*) into dict-words)) ;always returns nil
        (finally (return (reduce #'intersection dict-words)))))

(defun intern-symbol (&rest args)
  "Interns a symbol created by concatenating args.
   Based on symb in Let Over Lambda."
  (flet ((mkstr (&rest args)
           (with-output-to-string (s)
             (dolist (a args) (princ a s)))))
    (values (intern (apply #'mkstr args)))))

The function dictionary-compatible above runs fine in single-threaded mode in SBCL. But starting over with a new image in multi-threaded mode (using the :lparallel library) the gethash access to the global *lci-hash-table* always returns NIL.

The *lci-hash-table* is only being accessed, never updated, and therefore, according to the SBCL manual, there is no need to make it :synchronized, or defined with sb-ext:defglobal.

I think I'm missing something that's related to multi-processing for hash-tables in SBCL.

Edit: For background info, here is the code to populate the hash-table:

(defun create-lci-ht (dictionary-file)
  "Read dictionary words (symbols) from a file into strings
   and store in a lci hash table."
  (with-open-file (infile dictionary-file :direction :input :if-does-not-exist nil)
    (when (not (streamp infile)) (error "File does not exist!"))
    (with-open-file (stream dictionary-file)
      (let ((word-strings (uiop:read-file-lines stream)))
        (iter (for word-string in word-strings)
              (for word-length = (length word-string))
              (while word-string)
              (iter (for char in-sequence word-string)
                    (for index from 0)
                    (push (intern word-string)
                          (gethash (ut::intern-symbol word-length char index)
                                   *lci-hash-table*))))))))

Solution

  • This code exhibits two classic mistakes in CL: mistakes which everyone, including me, has made at some point!

    The first mistake is that you are explicitly creating and looking up symbols in a package without ever specifying the package, so you are simply relying on whatever the ambient value of *package* is when the code is running. That's, at the very least, fragile.

    • (in-package ...) in a source file sets the package during the time the code in the file is being compiled and loaded but not the ambient package outside the dynamic extent of the compilation / loading, since both load and compile-file rebind *package* to its ambient value.
    • Macros like with-standard-io-syntax rebind *package*.
    • The behaviour of *package*, and special variables in general, with multithreading is implementation-dependent and shims like Bordeaux Threads do not (and probably can not!) hide this.

    As an example of the last point, consider this code:

    ;;; Assume Bordeaux threads is loaded
    ;;;
    
    (defpackage :one
      (:use :cl :bordeaux-threads))
    
    (defpackage :two
      (:use :cl :bordeaux-threads))
    
    (in-package :one)
    
    (start-multiprocessing)
    
    (defun test (p1 p2)
      (setf p1 (find-package p1)
            p2 (find-package p2))
      (let ((*package* p1))
        (multiple-value-bind (tp1 tp2)
            (join-thread
             (make-thread
              (lambda ()
                (values *package*
                        (setf *package* p2)))))
          (values (package-name tp1) (package-name tp2)
                  (package-name *package*)))))
    

    Now in SBCL:

    > (test :one :two)
    "COMMON-LISP-USER"
    "TWO"
    "ONE"
    > (test :one :two)
    "TWO"
    "TWO"
    "ONE"
    

    Yes: different results!

    In LW:

    > (test :one :two)
    "ONE"
    "TWO"
    "ONE"
    > (test :one :two)
    "ONE"
    "TWO"
    "ONE"
    

    I don't understand SBCL's behaviour but I also have not read a lot of SBCL's manual in detail. I think what is probably happening is that in SBCL threads do not inherit special bindings, and *package* is bound somewhere in the REPL thread: this means that the assignment in the thread is to the global value of *package*. In LW threads do inherit their parent's bindings I think.

    The way of resolving this is to always be specific about what package you are using if you are calling functions like intern. This is what is actually breaking your code.

    The second mistake is that you are using packages to map strings to symbols and then using a hashtable to map symbols to values. In other words:

    1. you are looking up a string in a package, which is an object whose job is to map strings to symbols, which are objects which, among other things, have values and properties;
    2. you are then looking up the symbol you found in a hashtable, which is an object whose job is to map objects to values.

    OK, think about that: you're doing two lookups to get the value you're after. Instead, do one:

    • either look up the string in a package and then simply use the value of the resulting symbol;
    • or, and almost certainly better, look up the string in an equal hashtable directly.

    The first approach is essentially just using a package as a hashtable: this is fine so long as you are careful that this is all you use it for (so make sure you don't use it for the symbols which make up your program as well!), but it's ugly for two reasons:

    • packages are not completely first-class in CL, and in particular you can't have anonymous packages, so you end up having to be careful about names.
    • symbols tend to be rather heavyweight objects in CL implementations, with a bunch of slots of which you are using only one -- they don't have to be but they often are.

    So generally, if what you want is a map from strings to values, where those strings are not the names of parts of your program, the best approach is just to use an equal hashtable for this.


    If you really do need to use a package for some reason, and you want intern-symbol to take an unknown number of arguments, then probably you want intern-symbol either to take the package as its first argument:

    (defun intern-symbol (package &rest args)
      (intern (format nil "~{~A~}" args) package))
    

    Or you might provide an explicit variable

    (defvar *my-hashtable-package* (make-package "MY-HASHTABLE-PACKAGE" :use '()))
    
    (defun intern-symbol (&rest args)
      (intern (format nil "~{~A~}" args) *my-hashtable-package*))
    

    As a final note: intern-symbol is not reliable in general:

    > (eq (intern-symbol 1 23) (intern-symbol 1 2 3))
    t
    
    > (eq (intern-symbol 1 23) (intern-symbol 1 "2" 3))
    t
    

    Of course it may be fine in constrained cases.