Search code examples
algorithmstring-matchingfingerprint

How to fuzzy match a short bit pattern in a long one?


I encounter a problem when try to match a short bit pattern in a long one: I have one long bit pattern, e.g. 6k bits, stored in a char array, also a short one, say 150 bits, stored in a char array, too. Now I want to check whether the short bit pattern is in the long bit pattern. While there is no need for short bit pattern to match some part of long bit pattern exactly, I will define a threshold, if the bit-error-rate under it, I will take the two pattern match.

Given the misalignment problem, I can't come up with an elegant solution. One way I can find out is convert the bit pattern into char pattern, i.e. convert bit 1 to '1', 0 to '0' and apply some string matching algorithm. But I'm afraid it may cost memory 7-8 times more burden my system. Someone around me recommend the Rabin Fingerprint, however it seems not designed for this kind of problem.

Hope you can give me a hand.

Thanks and best Regards.


Solution

  • The operation you're looking for is population count or the closely related hamming distance.

    Rather than implement lots of bitwise arithmetic by hand, try the Gnu Multi-Precision Library, which includes several bit-string functions.

    • Use mpz_tdiv_q_2exp to right-shift the long pattern one bit at a time,
    • mpz_tdiv_r_2exp to extract the last 150 bits, and
    • mpz_hamdist to find the number of bits flipped between the extracted bits and the short pattern.

    Should be plenty fast, and fast to write as well!

    As an initial optimization, I would suggest shifting the 150-bit pattern by one-bit increments up to 7 bits, so you have 8 patterns to match against, from 150 up to 157 bits. Then, rather than shift the long pattern one bit at a time (which is slow and likely dominates the runtime), shift 8 bits at a time. Be sure to clear the bits you do not wish to compare.