Search code examples
databasealgorithmindexingbinaryhamming-distance

Storing and indexing binary strings in a database


A binary string as defined here is fixed size "array" of bits. I call them strings since there is no order on them (sorting/indexing them as numbers has no meaning), each bit is independent of the others. Each such string is N bits long, with N in the hundreds.

I need to store these strings and given a new binary string query for the nearest neighbor using the Hamming distance as the distance metric.
There are specialized data-structures (metric-trees) for metric-based search (VP-trees, cover-trees, M-trees), but I need to use a regular database (MongoDB in my case).

Is there some indexing function that can be applied to the binary strings that can help the DB access only a subset of the records before performing the one-to-one Hamming distance match? Alternatively, how would it be possible to implement such Hamming based search on a standard DB?


Solution

  • The hamming distance is a metric so it satisfies the triangle inequality. For each bitstring in your database, you could store the it's hamming distance to some pre-defined constant bitstring. Then you can use the triangle inequality to filter out bitstrings in the database.

    So let's say

    C <- some constant bitstring
    S <- bitstring you're trying to find the best match for
    B <- a bitstring in the database
    distS <- hamming_dist(S,C)
    distB <- hamming_dist(B,C)
    

    So for each B, you would store it's corresponding distB.

    A lower bound for hamming(B,S) would then be abs(distB-distS). And the upper bound would be distB+distS.

    You can discard all B such that the lower bound is higher than the lowest upper bound.

    I'm not 100% sure as to the optimal way to choose which C. I think you would want it to be a bitstring that's close to the "center" of your metric space of bitstrings.