Search code examples
javaperformancesearchstring-literalsstring-interning

Search cost of string interning and declaration of literal strings


Two Questions.

  1. When we declare literal strings, we search whether there is the same string in string pool of heap. Is this also an interning (method intern of class String)?

  2. In my thought, each literal string declaration needs a binary search or something so it costs at least log(n) when n is number of existing strings in the pool. And if there are many strings in the pool, it may be high cost. (maybe tradeoff of searching cost and memory?) On this point of view, it might be dangerous to declare mant literal strings. How significant is this searching cost and why java is designed in this way(searching pool when literal strings are declared).

Following is what I referred to understand background.


The JavaDoc for the java.lang.String class states:

Strings are constant; their values cannot be changed after they are created. String buffers support mutable strings. Because String objects are immutable they can be shared.

http://www.janeg.ca/scjp/lang/strLiteral.html comments:

In other words, because the compiler knows the strings original value cannot be changed once it's created it can safely use existing data and avoid cluttering up memory with duplicates.


Solution

  • You're confusing compile time complexity with runtime complexity.

    When the class is loaded, yes it does a search to see if each literal already exists (although I imagine it would use a hashmap for O(1) lookup instead of your proposal).

    When the code runs, it has the reference to the string in memory so there is no additional cost than a non-literal.

    So yes, literals are interned. According to the Javadoc for String,

    A pool of strings, initially empty, is maintained privately by the class String.

    You can invoke intern() on a String to add it to this pool. It logically follows that if a.equals(b) then a.intern() == b.intern(), since .intern() guarantees to return from the a unique pool.

    Example:

    class InternTest {
        // assuming InternTest is the only class, internPool.size = 0
        String x = "ABC"; // interned at class load, internPool.size = 1
        String y = "DEF"; // interned at class load, internPool.size = 2
        String z = "ABC"; // interned at class load, but match found - size = 2 still
    
        void foo() {
            // random int is just a mechanism to get something that I know won't
            // be interned at loadtime - could have loaded from file or database too
            int i = (new java.util.Random()).nextInt(1000) + 100;
            int j = i;
            String s = String.valueOf(i); // not yet interned, size = 2 still
            String t = String.valueOf(j); // not yet interned, size = 2 still
    
            String sIntern = s.intern(); // manually interned, size = 3 now
            String tIntern = t.intern(); // manually interned, match found, size = 3 still
    
            System.out.println("equals: " + (s.equals(t))); // should be true
            System.out.println("== raw: " + (s == t)); // should be false, different variables
            System.out.println("== int: " + (sIntern == tIntern)); // should be true, from unique pool
    
           System.out.println("x and z: " + (x == z)); // should be true, interned at class load
        }
    
        public static void main(String[] args) {
            (new InternTest()).foo();
        }
    
    }
    

    Results when run:

    C:\Documents and Settings\glowcoder\My Documents>java InternTest
    equals: true
    == raw: false
    == int: true
    x and z: true
    

    I should point out that the assumption will never be true. The Java language itself has many Strings that would be interned before our Strings ever got the chance to see the light of day. However, assuming everything is loaded sequentially, if you consider only the delta of Strings being interned, and assume no collisions with existing interns (we all know interns can be fussy and full of drama, right? snicker) then the numbers do indeed indicate the delta of the size of the string pool.