Search code examples
javahashcodegeohashing

Why java string hashCode has many collisions on different but similar geohash strings?


Geohash string is a feature in my sparse logistic regression model. So I used java string hashCode to generate int value on geohash string in order to get feature id. But I found hashCode method performs badly on similar geohash strings. It cause different features has the same feature id which may be bad in model optimization even the feature is similar. For example, those similar geohash string pairs have the same hashCode.

<"wws8vw", "wws8x9">
    "wws8vw".hashCode() = -774715770
    "wws8x9".hashCode() = -774715770
<"wmxy0", "wmxwn">
    "wmxy0".hashCode() = 113265337
    "wmxwn".hashCode() = 113265337

I guess it has some relationship between the geohash generator method and java hashCode method. So, anyone can explain me the true reason and how to decrease collisions on geohash string?


Solution

  • I think that you are misunderstanding the purpose of the Object.hashCode() method - not hashing in general, but the reason why Java objects have this method:

    This method is supported for the benefit of hash tables such as those provided by HashMap.

    So if you are trying to use this method as an input to a machine learning model, you're not using it for its intended purpose.

    The answer is reasonably obvious: you need to design your own hashing method - or select a pre-existing one - which gives you the desired collision profile for your expected inputs. The one used by String.hashCode() can't be changed by you.