This is more a gotcha I wanted to share than a question: when printing with toString()
, Java will detect direct cycles in a Collection (where the Collection refers to itself), but not indirect cycles (where a Collection refers to another Collection which refers to the first one - or with more steps).
import java.util.*;
public class ShonkyCycle {
static public void main(String[] args) {
List a = new LinkedList();
a.add(a); // direct cycle
System.out.println(a); // works: [(this Collection)]
List b = new LinkedList();
a.add(b);
b.add(a); // indirect cycle
System.out.println(a); // shonky: causes infinite loop!
}
}
This was a real gotcha for me, because it occurred in debugging code to print out the Collection (I was surprised when it caught a direct cycle, so I assumed incorrectly that they had implemented the check in general). There is a question: why?
The explanation I can think of is that it is very inexpensive to check for a collection that refers to itself, as you only need to store the collection (which you have already), but for longer cycles, you need to store all the collections you encounter, starting from the root. Additionally, you might not be able to tell for sure what the root is, and so you'd have to store every collection in the system - which you do anyway - but you'd also have to do a hash lookup on every collection element. It's very expensive for the relatively rare case of cycles (in most programming). (I think) the only reason it checks for direct cycles is because it so cheap (one reference comparison).
OK... I've kinda answered my own question - but have I missed anything important? Anyone want to add anything?
Clarification: I now realize the problem I saw is specific to printing a Collection (i.e. the toString()
method). There's no problem with cycles per se (I use them myself and need to have them); the problem is that Java can't print them. Edit Andrzej Doyle points out it's not just collections, but any object whose toString
is called.
Given that it's constrained to this method, here's an algorithm to check for it:
toString()
is invoked on (to determine this, you need to maintain state on whether a toString is currently in progress or not; so this is inconvenient).
This approach also correctly renders multirefs (a node that is referred to more than once).
The memory cost is the IdentityHashMap (one reference and index per object); the complexity cost is a hash lookup for every node in the directed graph (i.e. each object that is printed).
I think fundamentally it's because while the language tries to stop you from shooting yourself in the foot, it shouldn't really do so in a way that's expensive. So while it's almost free to compare object pointers (e.g. does obj == this
) anything beyond that involves invoking methods on the object you're passing in.
And at this point the library code doesn't know anything about the objects you're passing in. For one, the generics implementation doesn't know if they're instances of Collection
(or Iterable
) themselves, and while it could find this out via instanceof
, who's to say whether it's a "collection-like" object that isn't actually a collection, but still contains a deferred circular reference? Secondly, even if it is a collection there's no telling what it's actual implementation and thus behaviour is like. Theoretically one could have a collection containing all the Longs which is going to be used lazily; but since the library doesn't know this it would be hideously expensive to iterate over every entry. Or in fact one could even design a collection with an Iterator that never terminated (though this would be difficult to use in practice because so many constructs/library classes assume that hasNext
will eventually return false
).
So it basically comes down to an unknown, possibly infinite cost in order to stop you from doing something that might not actually be an issue anyway.