Search code examples
c#sql-serverregexhashgethashcode

Storing C# GetHashCode() in DB is Unreliable


Possible Duplicate:
How do I create a HashCode in .net (c#) for a string that is safe to store in a database?

I'm planning to store hundreds of thousands of URLs in my database. Each row in my UrlInfo table will be immutable with the URL itself being the logical primary key. Since URL's can be fairly lengthy, I've decided to hash the URL as a quick way to find possible matches when adding new rows. The hash isn't my true key, just a way to find possible matches quickly. In addition I'm using a RegEx pattern per domain that distills the essence of the URL into something that can be compared against other URL's. I'm storing the results of the RegEx also as a hash and I'm not concerned if it yields possible duplicates.

Everything has been going great until I learned that C#'s string.GetHashCode() method, which I've been using to hash things, is not guaranteed to be unique across implementations of .Net. I noticed this when I tried moving my hash function from ASP.Net to SQLServer CLR code. The web app uses .Net 4.0 and SQLServer 2008 R2, I've learned, uses .Net 3.5. They yielded separate hash results for the same string so now I need to get away from using string.GetHashCode() because I don't want to have to worry about this changing when I upgrade my app to to future versions of .Net.

So, questions:

  1. Does my architecture smell since I'm storing a hash in my db? Any better ways? Obviously Microsoft doesn't want me to store hash results!

  2. Can someone recommend a good C# replacement algorithm for hashing strings? I saw Jon's here but not exactly sure how to modify to work for strings (loop through each char using ascii codes?).

  3. Are there any string compression algorithms that would be better than using a hash algorithm?

Thanks

AWESOME RESPONSES BY MANY. THANK YOU VERY MUCH!!!


Solution

  • You could always use an MD5 hash instead, which is relatively fast:

    public string GetUrlHash(string url) {
    
        byte[] hash = MD5.Create().ComputeHash(Encoding.UTF8.GetBytes(url));
    
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < hash.Length; i++) {
            sb.Append(hash[i].ToString("X2"));
        }
    
        return sb.ToString();
    }
    

    Call it like so:

    Console.WriteLine(this.GetUrlHash("http://stackoverflow.com/questions/5355003/storing-c-gethashcode-in-db-is-unreliable"));
    

    And get:

    > 777BED7F83C66DAC111977067B4B4385
    

    This should be fairly reliable from an uniqueness standpoint. MD5 is insecure nowadays for password applications but you don't have that problem here.

    The only problem is using a string like this as a primary key on a table might be problematic, performance-wise.

    The other thing you could do is use the URL shortener approach: use your database's sequence generation feature, and convert the value (make sure you use the equivalent of LONG or BIGINT!) to something like Base36, which gives you a nice, concise string.