Search code examples
collectionssmalltalkgnu-smalltalk

Why adding a collection to itself blows up in Smalltalk?


I wonder why this is not terminating in GNU Smalltalk:

s := Set new. s add: s

In theory, s should be simply a set containing an empty set. But executing that just loops forever, blowing up the heap.

Interestingly, ((s := Set with: 4 with: 5 with: 6) add: s) size. terminates and evaluates to 4.


Solution

  • An introduction

    A Set is a kind of HashedCollection specifically designed for fast membership check. Internally a Set has a HashTable, a sparce array with many empty slots to minimize the number of collisions. When we #add: an element to the Set, an index to the HashTable is calculated as (hash \\ size) + 1, where #\\ is the mod operation and size is the length of the table. If the slot at index is already occupied, the index is incremented until a free slot is found, and the element is stored there (N.B., + 1 because Smalltalk arrays are 1-based.) [cf. How does Set actually work]

    Our case

    Now, let's see what happens with the code of the question:

    1. s := Set new.
    2. s add: s.
    

    As described above, in step 2, add: s will compute:

    s hash \\ p + 1
    

    where p is the initial number of slots of s's inner table (a prime number, set to 5 or 7 when the set is first created, and increased later as needed.)

    So far, so good. But there might be some

    Issues

    Where? Depending on the Smalltalk dialect, there might be an issue with #printOn: or with the next add:ition of an element.

    The printing issue

    A problem that might happen with #printOn: is infinite recursion. When printing s, one would want to print its elements as well, and in our case such an approach would recursively try to print s in the process, creating an endless circularity.

    To prevent this from happening, Pharo uses a LimitedWriteStream that will stop writing after a certain number of iterations, breaking the recursion, should it exist.

    I haven't checked it myself, but this is the issue that seems to be happening in GNU Smalltalk (according to the question.)

    Note that it is not enough to print only a max number of elements in the Set. In fact, our set only contains one element (itself), which is enough to create the recursion.

    The addition issue

    As observed by @aka.nice in his comment, care must also be taken when adding s a second time. Why? Because, as we have seen in the introduction above, the message add: s will need to compute s hash …. And how is s hash defined? This is an interesting question. Smalltalkers are routinely confronted with the problem of implementing a good #hash in some classes. Since s is a collection, it is tempting to take the hash of its elements into account for the final result, right? Pharo takes this approach, look:

    hash
      | hash |
      hash := self species hash.
      self size <= 10 ifTrue:
        [self do: [:elem | hash := hash bitXor: elem hash]].
      ^hash bitXor: self size hash
    

    which swallows the bait. Why? Because the set s has 1 element (itself), so the condition self size <= 10 is true, and the method will try to recursively compute s hash again, leading to, oh oh, Stack Overflow.

    Conclusions

    1. Be careful when implementing Collection >> #printOn:. Even if the collection doesn't contain itself, a recursion might arise if there is an indirect reference from one of the elements back to the collection containing it.

    2. Be careful when implementing Collection >> #hash (same reasons)

    3. Be careful when adding a Collection to itself. More generally, be careful when a collection contains an element with a (possibly indirect) reference to it, because enumerating such a collection may be tricky.

    4. In Mathematics (the Science), a Set cannot be an element of itself (this is explicitly forbidden by the Axioms of Set Theory). So, re-think your model before violating this rule that comes from an extremely mature and evolved scientific body of knowledge.