Search code examples
hashhashtablerecommendation-engine

recommender systems: Convert UUIDs to 32 bit ints for recommender libraries


LightFM and other libraries ask for a 32 bit integer id e.g for users. But, our user id is a UUID e.g. 0003374a-a35c-46ed-96d2-0ea32b753199. I was wondering what you would recommend in scenarios like these. What I have come up with is:

  • Create a bidirectional dictionary either in memory or in a database to keep a UUID <-> Int mapping. e.g. https://github.com/jab/bidict
  • Use a non cryptographic hash function like MurmurHash3 or xxHash. For e.g. for 10 million UUIDs, I got around 11,521 or 0.1% collision using xxhash. Is that negligible for a recommender system?

I'm also curious on how this would apply in an online prediction scenario, where given the UUID, user interactions and the model, I have to predict the recommendations for a model which needs 32 bit integers. If I use the in memory bidict approach, then that won't work in this case and hence I may have to create a persistent key-value store in the worst case.


Solution

    1. This will definitely work, and is probably the solution the vast majority of users will choose. The disadvantage lies, of course, in having to maintain the mapping.
    2. A hashing function will also work. There are, in fact, approaches which use hashing to reduce the dimensionality of the embedding layers required. One thing worth bearing in mind is that the resulting hash range should be relatively compact: most implementations will allocate parameters for all possible values, so a hashing function that can hash to very large values will require exorbitant amounts of memory. Hashing following by a modulo function could work well; the trade-off then is between memory required to hold all parameters and collision probability.

    In LightFM as well as most other implementations, recommendations can only be made for users and items (or at least for user and item features) that were present during the training. The mapping will then be a part of the model itself, and be effectively frozen until a new model is trained.