Search code examples
algorithmsignal-processingsimilaritylevenshtein-distanceinformation-theory

Is there a version of Levenshtein distance that works for series of floats?


I want to calculate the similarity between time series data segments that can be of different lengths. In finding a similarity metric I would like to take into account differences in length as well as value. I thought Levenshtein distance would be great for this, if only it would work on series of floats instead of strings.

This question explains how to use Levenshtein distance with lists of ints when the differences in the values of the ints being replaced do not matter. In this case the differences in the values DO matter, and larger differences should be penalized more (and I'm working with floats).

Of course I am open to other similarity metrics that accomplish something similar, I just thought Levenshtein distance was already very close to what I wanted.

Example:

  1. (0.22, 0.8, 1.2, 3.89)
  2. (0.2, 0.61, 9.2)

Small penalty for comparing 1st elements, a little larger for the next elements, then large penalty for the 3rd, and a deletion penalty for the last element.


Solution

  • I think the Levenshtein distance isn't suited for this. Because its computational cost is considerable compared to the simple metric that is arithmetic difference, or euclidean distance.

    In your question, the problem seems to be definition of a similarity function that combines difference in content and difference in length (of a time series segment).

    In any case, you better ask on the signal-processing and information-theory tags because there's certain to be an established metric/similarity function for your case. Levenshtein's "edit distance" is inherently suited for alphabets/NLP, in your case I would simply recommend quantity of information. A cross-correlation might be what you are searching for.