Search code examples
javaalgorithmperformanceimplementationbk-tree

Has this algorithm been implemented properly?


I am currently implementing a BK-Tree to make a spell checker. The dictionary I am working with is very large (millions of words), which is why I cannot afford any inefficiencies at all. However, I know that the lookup function that I wrote (arguably the most important part of the entire program) can be made better. I was hoping to find some help regarding the same. Here's the lookup that I wrote:

public int get(String query, int maxDistance)
{
    calculateLevenshteinDistance cld = new calculateLevenshteinDistance();
    int d = cld.calculate(root, query);
    int tempDistance=0;

    if(d==0)
        return 0;

    if(maxDistance==Integer.MAX_VALUE)
        maxDistance=d;

    int i = Math.max(d-maxDistance, 1);
    BKTree temp=null;

    for(;i<=maxDistance+d;i++)
    {
        temp=children.get(i);
        if(temp!=null)
        {
            tempDistance=temp.get(query, maxDistance);
        }
        if(maxDistance<tempDistance)
            maxDistance=tempDistance;
    }

    return maxDistance;
}

I know that I am running the loop an unnecessarily large number of times and that we can trim the search space to make the lookup faster. I'm just not sure how to best do that.


Solution

  • Your loop looks generally correct, if a little byzantine. Your attempt to refine the stopping condition (with tempdistance/maxdistance) is incorrect, however: the structure of the BK-tree requires that you explore all nodes within levenshtein distance d-k to d+k of the current node if you want to find all the results, so you can't prune it like that.

    What makes you think you're exploring too much of the tree?

    You may find my followup post on Levenshtein Automata instructive, as they're more efficient than BK-trees. If you're building a spelling checker, though, I'd recommend following Favonius' suggestion and checking out this article on how to write one. It's much better suited to spelling correction than a naive string-distance check.