Search code examples
stringalgorithmlanguage-agnosticuniqueidentifier

How to uniquely identify a set of strings using an integer


Here my problem statement:

  • I have a set of strings that match a regular expression. let's say it matches [A-Z][0-9]{3} (i.e. 1 letter and 3 digits).
  • I can have any number of strings between 1 and 30. For example I could have:
    • {A123}
    • {A123, B456}
    • {Z789, D752, E147, ..., Q665}
    • ...
  • I need to generate an integer (actually I can use 256 bits) that would be unique for any set of strings regardless of the number of elements (although the number of elements could be used to generate the integer)

What sort of algorithm could I use?

My first idea would be to convert my strings to number and then do operations (I thought of hash functions) on them but I am not sure what formula would be give me could results.

Any suggestion?


Solution

  • You have 2^333 possible input sets ((26 * 10^3) choose 30).

    This means you would need a 333 bit wide integer to represent all possibilities. You only have a maximum of 256 bits, so there will be collisions.

    This is a typical application for a hash function. There are hashes for various purposes, so it's important to select the right type:

    • A simple hash function for use in bucket based data structures (dictionaries) must be fast. Collisions are not only tolerated but wanted. The hash's size (in bits) usually is small. Due to collisions this type of hash is not suited for your purpose.

    • A checksum tries to avoid collisions and is reasonably fast. If it's large enough this might be enough for your case.

    • Cryptographic hashes have the characteristic that it's not possible (or very hard) to find a collision (even when both input and hash are known). Also they are not invertible (from the hash it's not possible to find the input). These are usually computationally expensive and overkill for your use case.

    • Hashes to uniquely identify arbitrary inputs, like CityHash and SpookyHash are designed for fast hashing and collision free identification.

    SpookyHash seems like a good candidate for your use case. It's 128 bits wide, which means that you need 2^64 differing inputs to get a 50% chance of a single collision.

    It's also fast: three bytes per cycle is orders of magnitude faster than md5 or sha1. SpookyHash is available in the public domain (see link above).

    To apply any hash on your use case you could convert the items in your list to numbers, but it seems easier to just feed them as strings. You have to settle for an encoding in this case (ASCII would do).

    I'm usually using UTF8 or so, when I18N is an issue. Then it's sometimes important to care for canonicalization. But this does not apply to your simple use case.