Search code examples
pythoncluster-analysislevenshtein-distancehierarchical

Implementing hierarchical clustering in Python using Levenshtein distance


Following up on my previous question, I have implemented a clustering algorithm for a huge number of strings using Python & Levenshtein distance..But it is taking a very long time to complete clustering. Any suggestions please?

<> iterate thro the list in a for loop for each item in list run through the list again, to find similarity percentage if similarity > threshold, move to cluster end for loop


Solution

  • First, use a profiler to see where most of the time is spent. I suspect it's in the actual Levenshtein calculation, but it's good to be sure. Iff it is:

    1. Implement the Levenshtein function with Cython. This will give you a massive speed-up already.
    2. Calculate the pairs in multiple threads. E.g. If you have 1000 strings, you have 1000000 pairs, so you can have each of 8 threads do 125000 of the pairs.