Search code examples
databasealgorithmdata-structuresrdbms

Find closest to completion strings in set given a subset


Let's say that I have a set of words, where chars are in the set of a-z. Let's also assume that the words are up to 10 chars long and the set can be built by all the combinations (no permutations, so I don't care about the order) of such chars with no duplicates. The set which is in a DB, is empty at the beginning and someone has to populate it. I have complete freedom on how to structure the DB in order to make it efficient for this query. Let's start with an example.

Someone is populating the db by inserting these words:

"ab"
"ac"
"abcde"
"def"
"xyz"

Now my subset is this one:

"cabd"

What my query/algorithm should do, is that it should return me the list of words ordered by "completion". To be more clear the above query should return these words in order:

"ab"
"ac"
"abcde"
"def"
"xyz"

Let's explain:

  • "ab" and "ac" should be at the top (I don't care which one comes first in this case) because my subset contains all the chars that are in this words.
  • "abcde" is the third because I own all but "e" of the word, so I'm missing just 1 char
  • "def" comes later because I'm missing 2 chars of the word ("e" and "f")
  • "xyz" comes at the end as I don't own any char in it

Further observations: As you can see, I don't care about the order. If my subset query would have been "abcd" the results should be exactly the same.

Now things get complicated: Every word is stored in the DB with an ID as a primary key. The ideal solution should be that the algorithm should print 10 (or a limited number) of IDs that I will use to query the words by my own. FYI, I'm using Firebase, so I can not rely on SQL at the moment

A bruteforce solution would be to store in a different table a char-word relation. So to store all the word ids that contain a specific char:

a : {
    "id1",
    "id2",
    "id3",
    "id4",
    ....
}
b : {
    "id1",
    "id4",
    ....
}

Where ids are:

id1 : {
    "ab"
}
id2 : {
    "ac"
}
id3 : {
    "ad"
}
id4 : {
    "abc"
}

As you can see, with this approach the algorithm would give me thousands of results that I will need to query and order, so it's not scalable. Is there any other solution or smart approach to solve this problem?


Solution

  • The best solution may depend on the SQL engine you are using, as some will have functions to solve some of the steps needed.

    Here is one idea:

    In the table with words you could add an integer column that will represent the letters that occur in the word. An integer has enough bits available for storing one bit of information per letter in the alphabet: a 1 would mean the corresponding letter occurs, a 0 that it doesn't. So 26 bits are needed for representing characters in the range a-z.

    You could then create a trigger on that table so that this integer is calculated and stored whenever you insert a new word in that table.

    Then for a given input word X, you would also calculate that integer. To get the right order, you would then perform the bitwise OR of this integer with each of the integers in your table, and count the 1-bits in the result. The fewer 1-bits, the better the match. The fewest number of 1-bits will correspond to the number of bits in the integer representation of X. Every bit that is counted on top of that, indicates a character in the table row that does not occur in X.

    Here is a script for setting this up in MySql:

    --/
    create function bitset(str varchar(10)) returns int
    begin
    declare num int;
        set num = 0;
        while length(str) > 0 do
            set num = num | power(2, ord(str) - ord('a'));
            set str = substr(str, 2);
        end while;
        return num;
    end
    /
    
    create table words (
        word varchar(10),
        bits int
    );
    
    create trigger ins_word before insert on words for each row
        set new.bits = bitset(new.word);
    
    insert into words(word) values ('ab'), ('ac'), ('abcde'), ('def'), ('xyz');
    
    select   word, 
             bits,
             bit_count(bitset('cabd') | bits) bitwise_or
    from     words
    order by 3;
    

    The final query makes use of the custom function bitset and a function bit_count that is native in MySql.

    The output of the final query would look like this:

     word |    bits  | bitwise_or
    ------+----------+-----------
    ab    |        3 |    4
    ac    |        5 |    4
    abcde |       31 |    5
    def   |       56 |    6
    xyz   | 58720256 |    7