Search code examples
pythonperformancetextfuzzy-searchbk-tree

How to implement a fast fuzzy-search engine using BK-trees when the corpus has 10 billion unique DNA sequences?


I am trying to use the BK-tree data structure in python to store a corpus with ~10 billion entries (1e10) in order to implement a fast fuzzy search engine.

Once I add over ~10 million (1e7) values to a single BK-tree, I start to see a significant degradation in the performance of querying.

I was thinking to store the corpus into a forest of a thousand BK-trees and to query them in parallel.

Does this idea sound feasible? Should I create and query 1,000 BK-trees simultaneously? What else can I do in order to use BK-tree for this corpus.

I use pybktree.py and my queries are intended to find all entries within an edit distance d.

Is there some architecture or database which will allow me to store those trees?

Note: I don’t run out of memory, rather the tree begins to be inefficient (presumably each node has too many children).


Solution

  • FuzzyWuzzy

    Since you are mentioning your usage of FuzzyWuzzy as distance metric I will concentrate on efficient ways to implement the fuzz.ratio algorithm used by FuzzyWuzzy. FuzzyWuzzy provides the following two implementations for fuzz.ratio:

    1. difflib, which is completely implemented in Python
    2. python-Levenshtein which uses a weighted Levenshtein distance with the weight 2 for substitutions (substitutions are deletion + insertion). Python-Levenshtein is implemented in C and a lot faster than the pure Python implementation.

    Implementation in python-Levenshtein

    The implementation of python-Levenshtein uses the following implementation:

    1. removes common prefix and suffix of the two strings, since they do not have any influence on the end result. This can be done in linear time, so matching similar strings is very fast.
    2. The Levenshtein distance between the trimmed strings is implemented with quadratic runtime and linear memory usage.

    RapidFuzz

    I am the author of the library RapidFuzz which implements the algorithms used by FuzzyWuzzy in a more performant way. RapidFuzz uses the following interface for fuzz.ratio:

    def ratio(s1, s2, processor = None, score_cutoff = 0)
    

    The additional score_cutoff parameter can be used to provide a score threshold as a float between 0 and 100. For ratio < score_cutoff 0 is returned instead. This can be used by the implementation to use more a more optimized implementation in some cases. In the following I will describe the optimizations used by RapidFuzz depending on the input parameters. In the following max distance refers to the maximum distance that is possible without getting a ratio below the score threshold.

    max distance == 0

    The similarity can be calculated using a direct comparison, since no difference between the strings is allowed. The time complexity of this algorithm is O(N).

    max distance == 1 and len(s1) == len(s2)

    The similarity can be calculated using a direct comparisons as well, since a substitution would cause a edit distance higher than max distance. The time complexity of this algorithm is O(N).

    Remove common prefix

    A common prefix/suffix of the two compared strings does not affect the Levenshtein distance, so the affix is removed before calculating the similarity. This step is performed for any of the following algorithms.

    max distance <= 4

    The mbleven algorithm is used. This algorithm checks all possible edit operations that are possible under the threshold max distance. A description of the original algorithm can be found here. I changed this algorithm to support the weigth of 2 for substitutions. As a difference to the normal Levenshtein distance this algorithm can even be used up to a threshold of 4 here, since the higher weight of substitutions decreases the amount of possible edit operations. The time complexity of this algorithm is O(N).

    len(shorter string) <= 64 after removing common affix

    The BitPAl algorithm is used, which calculates the Levenshtein distance in parallel. The algorithm is described here and is extended with support for UTF32 in this implementation. The time complexity of this algorithm is O(N).

    Strings with a length > 64

    The Levenshtein distance is calculated using Wagner-Fischer with Ukkonens optimization. The time complexity of this algorithm is O(N * M). This could be replaced with a blockwise implementation of BitPal in the future.

    Improvements to processors

    FuzzyWuzzy provides multiple processors like process.extractOne that are used to calculate the similarity between a query and multiple choices. Implementing this in C++ as well allows two more important optimizations:

    1. when a scorer is used that is implemented in C++ as well we can directly call the C++ implementation of the scorer and do not have to go back and forth between Python and C++, which provides a massive speedup

    2. We can preprocess the query depending on the scorer that is used. As an example when fuzz.ratio is used as scorer it only has to store the query into the 64bit blocks used by BitPal once, which saves around 50% of the runtime when calculating the Levenshtein distance

    So far only extractOne and extract_iter are implemented in Python, while extract which you would use is still implemented in Python and uses extract_iter. So it can already use the 2. optimization, but still has to switch a lot between Python and C++ which is not optimal (This will probably be added in v1.0.0 as well).

    Benchmarks

    I performed benchmarks for extractOne and the individual scorers during the development that shows the performance difference between RapidFuzz and FuzzyWuzzy. Keep in mind that the performance for your case (all strings length 20) is probably not as good, since many of the strings in the dataset used are very small.

    The source of the reproducible-science DATA :

    The hardware the graphed benchmarks were run on (specification) :

    • CPU: single core of a i7-8550U
    • RAM: 8 GB
    • OS: Fedora 32

    Benchmark Scorers

    The code for this benchmark can be found here Benchmark Scorers

    Benchmark extractOne

    For this benchmark the code of process.extractOne is slightly changed to remove the score_cutoff parameter. This is done because in extractOne the score_cutoff is increased whenever a better match is found (and it exits once it finds a perfect match). In the future it would make more sense to benchmark process.extract which does not has this behavior (the benchmark is performed using process.extractOne, since process.extract is not fully implemented in C++ yet). The benchmark code can be found here Benchmark extractOne

    This shows that when possible the scorers should not be used directly but through the processors, that can perform a lot more optimizations.

    Alternative

    As an Alternative you could use a C++ implementation. The library RapidFuzz is available for C++ here. The implementation in C++ is relatively simple as well

    // function to load words into vector
    std::vector<std::string> choices = load("words.txt");
    std::string query = choices[0];
    std::vector<double> results;
    results.reserve(choices.size());
    
    rapidfuzz::fuzz::CachedRatio<decltype(query)> scorer(query);
    for (const auto& choice : choices)
    {
      results.push_back(scorer.ratio(choice));
    }
    

    or in parallel using open mp

    // function to load words into vector
    std::vector<std::string> choices = load("words.txt");
    std::string query = choices[0];
    std::vector<double> results;
    results.reserve(choices.size());
    
    rapidfuzz::fuzz::CachedRatio<decltype(query)> scorer(query);
    
    #pragma omp parallel for
    for (const auto& choice : choices)
    {
      results.push_back(scorer.ratio(choice));
    }
    

    On my machine (see Benchmark above) this evaluates 43 million words/sec and 123 million words/sec in the parallel version. This is around 1.5 times as fast as the Python implementation (due to conversions between Python and C++ Types). However the main advantage of the C++ version is that you are relatively free to combine algorithms whichever way you want, while in the Python version your forced to use the process functions that are implemented in C++ to achieve good performance.