Search code examples
c#.netstringstackheap-memory

String.Intern result inconsistency and possible use cases


        string s1 = new string (new char[] { 'H', 'e', 'l', 'l', 'o' });
        string i1 = string.Intern(s1);
        bool result1 = ReferenceEquals(s1, i1);

        string s2 = new string(new char[] { 'H', 'e', 'l', 'l', 'o' });
        string i2 = string.Intern(s2);
        bool result2 = ReferenceEquals(s2, i2);

        bool result3 = ReferenceEquals(i1, i2);
        Console.WriteLine(result1); // Should be False
        Console.WriteLine(result2); // Should be False
        Console.WriteLine(result3); // Should be True

The output of this little test is True, False, True on my PC and this online compiler but False, False, True in this online compiler. I asked GPT about this and it pointed to the second results being true as well.

If I understood correctly, s1 creates a new string object on the heap (which is automatically registered in the intern pool) and holds the reference to that string object. i1 checks the VALUE of s1 to see if it exists in the intern pool, and since it does, i1 takes the reference from the intern pool, making s1 and i1 reference the same string object.

Similarly, s2 creates another string object with the same value as s1 on the heap (although I'm not sure if it is immediately added to the intern pool). i2 checks the VALUE of s2 in the intern pool, finds that this string object has already been created, and thus takes the reference from the intern pool. As a result, i1 and i2 reference the same object, so they return true.

I'd be glad if someone could explain how the intern works and why the results differ? Also what would be the use case for this function in a team environment?

PS: I'm running it in .net 7 in local and 8 in online compilers.


Solution

  • Disclaimer: I have only tested this on .NET 8 in Debug mode. The output I get is true, false, true and that can be explained by tracing through this in WinDbg.

    The behavior is as follows: Creating a string from char[] does not automatically intern the string. It is only when you call Intern, that the reference is added to the intern table. When you call Intern the table is scanned for a match and if a match is found that reference is returned. Otherwise it is added to the table and then returned.

    Initially, we create a string from char[]. This string is not interned by the runtime as it isn't considered a string literal. By listing locals I get the following:

      LOCALS:
        0x000000B35897E858 = 0x000002a4c0440708 <- this is the ref for our string
    

    By doing gcroot, I see that it is only rooted by the local reference. I.e. it isn't interned.

    0:000> !gcroot 000002a4c0440708
    Caching GC roots, this may take a while.
    Subsequent runs of this command will be faster.
    
    Thread 49b4:
        rbp+50: 000000b35897e810
          -> 02a4c0440708     System.String 
    
        rbp+98: 000000b35897e858
          -> 02a4c0440708     System.String 
    

    Next, we intern this string and that returns the reference to the interned string. The debugger show that both locals point to the same address:

       LOCALS:
        0x000000B35897E858 = 0x000002a4c0440708
        0x000000B35897E850 = 0x000002a4c0440708
    

    gcroot now lists an additional root.

    0:000> !gcroot 000002a4c0440708
    
    HandleTable:
        000002a4bbf712d0 (strong handle)
              -> 02a4bdc07ff0     System.Object[] 
              -> 02a4c0440708     System.String 
    

    The strong handle is the intern table. What that means is that the reference above was added to the static array which is the intern table. It doesn't change the reference though, it just adds it so now the string is also rooted by the intern table.

    This matches the output of the reference compare.

    Next, we generate a new string. Since, this is just a regular string it is allocated on the heap and gets a different address.

    PARAMETERS:
        args (0x000000B35897E880) = 0x000002a4bfc08308
    LOCALS:
        0x000000B35897E858 = 0x000002a4c0440708
        0x000000B35897E850 = 0x000002a4c0440708
        0x000000B35897E84C = 0x0000000000000001
        0x000000B35897E840 = 0x000002a4c0440790 <--- this is s2
    

    I have annotated the dump above to show s2. (the third address holds the bool from the first compare, i.e. result1.

    Next, we try to intern this string. This compares the strings in the intern table and we find that we already have a reference (the first string) so the second call to Intern returns this reference as can be seen from the locals.

    PARAMETERS:
        args (0x000000B35897E880) = 0x000002a4bfc08308
    LOCALS:
        0x000000B35897E858 = 0x000002a4c0440708
        0x000000B35897E850 = 0x000002a4c0440708
        0x000000B35897E84C = 0x0000000000000001
        0x000000B35897E840 = 0x000002a4c0440790
        0x000000B35897E838 = 0x000002a4c0440708
    

    This means that the address of s2 and i2 are different matching the output of false.

    I cannot say why you see a different result on the online compiler. Perhaps they are using a different runtime, but the output matches what I would expect.