Search code examples
algorithmdata-structuresradix-tree

Words with at least 2 common letters


A string is named 2-consistent if each word has at least 2 letters in common with the next word.


For example

"Atom another era" [atom has a and t in common with another and another has e and a in common with era (the answer is not unique).

First of all I need a data structure which takes 2 words and answers in constant time at the question "Do these words have at least 2 letters in common?"

Now, given a string of n words I need to find the longest 2-consistent substring.

I can't figure out what data structure to use. I thought to radix tree or prefix tree, but I could not find the answer. Can you help me?


Solution

  • Assuming unaccented letters and ignoring capitalization, for each word you can store a bit-field in a 32-bit integer where bits 0-25 are set to 1 if the corresponding letter from a-z is present.

    The integer can be computed in linear time like this:

    int getBitField(char* word)
    {
        int bits = 0;
        while(*word)
            bits |= 1 << ((*word++) - 'a');
        return bits;
    }
    

    If the words are assumed to be words in English or some other language, with a maximum word length then the difference between constant and linear time is fairly meaningless because (for the sake of argument) all words less than the maximum length can be padded out with non-matching characters, which will result in a constant time algorithm.

    Once you have the bit fields for two words you can test if they are 2-consistent in constant time by ANDing them together and checking if the result is not zero (which would indicate no letters in common) and not a power of 2 (which would indicate only one letter in common as only a single bit is set). You can test for a power of 2 by ANDing a number with itself minus 1.

    bool is2Consistent(int word1bits, int word2bits)
    {
        int common = word1bits & word2bits;
        return (common & (common - 1)) != 0;
    }
    

    This won't work if you intend to define words like 'meet' and 'beef' which have repeated letters as 2-consistent.

    If you wanted to test for 3-consistency, you just need to add an extra line to the function:

    bool is3Consistent(int word1bits, int word2bits)
    {
        int common = word1bits & word2bits;
        common &= (common - 1);
        return (common & (common - 1)) != 0;
    }
    

    ANDing an integer with itself minus one just removes the least significant bit, so you could apply it an arbitrary number of times to test for 4-consistency, 5-consistency etc.