Search code examples
javacastingunique

how can i generate a unique int from a unique string?


I have an object with a String that holds a unique id . (such as "ocx7gf" or "67hfs8") I need to supply it an implementation of int hascode() which will be unique obviously.

how do i cast a string to a unique int in the easiest/fastest way?

10x.

Edit - OK. I already know that String.hashcode is possible. But it is not recommended in any place. Actually' if any other method is not recommended - Should I use it or not if I have my object in a collection and I need the hashcode. should I concat it to another string to make it more successful?


Solution

  • No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.

    What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others). Barring special knowledge of your format, then just using the hashcode of the string itself would be best.

    With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.

    Edit: On good spread of bits.

    As stated here and in other answers, being completely unique is impossible and hash collisions are possible. Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.

    Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to e.g. one in the range 0 to 22, and we want as good a distribution within that as possible to.

    We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself. An imperfect balancing act.

    A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:

    return X ^ Y;
    

    While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number. We are likely better served by:

    return ((X << 16) | (x >> 16)) ^ Y;
    

    Now, there are just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.

    Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand. For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one. The writer of Date though can't work on such knowledge and has to try to cater for everyone.

    Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [a-z] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.

    Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.

    However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).

    One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined. This doesn't always hold.