Search code examples
c#.netstringperformancegethashcode

Does String.GetHashCode consider the full string or only part of it?


I'm just curious because I guess it will have impact on performance. Does it consider the full string? If yes, it will be slow on long string. If it only consider part of the string, it will have bad performance (e.g. if it only consider the beginning of the string, it will have bad performance if a HashSet contains mostly strings with the same.


Solution

  • Be sure to obtain the Reference Source source code when you have questions like this. There's a lot more to it than what you can see from a decompiler. Pick the one that matches your preferred .NET target, the method has changed a great deal between versions. I'll just reproduce the .NET 4.5 version of it here, retrieved from Source.NET 4.5\4.6.0.0\net\clr\src\BCL\System\String.cs\604718\String.cs

            public override int GetHashCode() { 
    
    #if FEATURE_RANDOMIZED_STRING_HASHING
                if(HashHelpers.s_UseRandomizedStringHashing)
                { 
                    return InternalMarvin32HashString(this, this.Length, 0);
                } 
    #endif // FEATURE_RANDOMIZED_STRING_HASHING 
    
                unsafe { 
                    fixed (char *src = this) {
                        Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
                        Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");
    
    #if WIN32
                        int hash1 = (5381<<16) + 5381; 
    #else 
                        int hash1 = 5381;
    #endif 
                        int hash2 = hash1;
    
    #if WIN32
                        // 32 bit machines. 
                        int* pint = (int *)src;
                        int len = this.Length; 
                        while (len > 2) 
                        {
                            hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0]; 
                            hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
                            pint += 2;
                            len  -= 4;
                        } 
    
                        if (len > 0) 
                        { 
                            hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
                        } 
    #else
                        int     c;
                        char *s = src;
                        while ((c = s[0]) != 0) { 
                            hash1 = ((hash1 << 5) + hash1) ^ c;
                            c = s[1]; 
                            if (c == 0) 
                                break;
                            hash2 = ((hash2 << 5) + hash2) ^ c; 
                            s += 2;
                        }
    #endif
    #if DEBUG 
                        // We want to ensure we can change our hash function daily.
                        // This is perfectly fine as long as you don't persist the 
                        // value from GetHashCode to disk or count on String A 
                        // hashing before string B.  Those are bugs in your code.
                        hash1 ^= ThisAssembly.DailyBuildNumber; 
    #endif
                        return hash1 + (hash2 * 1566083941);
                    }
                } 
            }
    

    This is possibly more than you bargained for, I'll annotate the code a bit:

    • The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
    • The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
    • The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
    • The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
    • The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
    • The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
    • The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
    • The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
    • The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.