Search code examples
c++algorithmprocessing-efficiencycpu-speed

calculating how much two strings are similar?


I have a function that calculates the variance of the two given strings. Is there a faster method (or algorithm) to do such thing?

Please keep in mind that every letter of my strings are all loaded with DNA, which means that these are one of the A or T or C or G:

unsigned __int8 dis(char* FirstString, char* SecondString)
{
    unsigned __int8 distanceIndex = 0;
    for (unsigned __int8 i = 0; i < l; i++)
    {
        if (FirstString[i] != SecondString[i])
            distanceIndex++;
    }
    return distanceIndex;
}

Solution

  • Although I still doubt that the string comparison is really the bottleneck of your project, I couldn't resist to take up the challenge...

    All of your sequences are 13 chars long. DNA sequences contain only the letters ATCG, which can be encoded within 2 bits. You can store each DNA sequence within a 32 bit value, letting the computer do the comparison in parallel:

    • XOR-combine the values to get bit differences
    • shift and OR-combine AND-normalized subsets (odd bits, even bits) to transform bit differences into nucleobase differences
    • count the set bits to obtain the DNA sequence distance

    Depending on the computer architecture there may be a bit count function implemented in the CPU. More details have the answers to the question: How to count the number of set bits in a 32-bit integer?

    Here is the core function:

    int distV(const unsigned va, const unsigned vb)
    {
        const unsigned x = va ^ vb;
        const unsigned bn = ((x & 0xaaaaaaaa) >> 1 ) | (x & 0x55555555);
        return __builtin_popcount(bn);
    }
    

    See the full GCC-4.3.2 demo that uses sequences of length 16. I measured a performance increment of factor 4 for the comparison itself (excluding the encoding).