Search code examples
pythonnumpytime-seriesdata-analysissimilarity

What is the correct way to format the parameters for DTW in Similarity Measures?


I am trying to use the DTW algorithm from the Similarity Measures library. However, I get hit with an error that states a 2-Dimensional Array is required. I am not sure I understand how to properly format the data, and the documentation is leaving me scratching my head.

https://github.com/cjekel/similarity_measures/blob/master/docs/similaritymeasures.html

According to the documentation the function takes two arguments (exp_data and num_data ) for the data set, which makes sense. What doesn't make sense to me is:

exp_data : array_like

Curve from your experimental data. exp_data is of (M, N) shape, where M is the number of data points, and N is the number of dimensions

This is the same for both the exp_data and num_data arguments.

So, for further clarification, let's say I am implementing the fastdtw library. It looks like this:

from fastdtw import fastdtw
from scipy.spatial.distance import euclidean

x = np.array([1, 2, 3, 3, 7])
y = np.array([1, 2, 2, 2, 2, 2, 2, 4])

distance, path = fastdtw(x, y, dist=euclidean)

print(distance)
print(path)

Or I can implement the same code with dtaidistance:

from dtaidistance import dtw

x = [1, 2, 3, 3, 7]
y = [1, 2, 2, 2, 2, 2, 2, 4]

distance = dtw.distance(x, y)

print(distance)

However, using this same code with Similarity Measures results in an error. For example:

import similaritymeasures
import numpy as np

x = np.array([1, 2, 3, 3, 7])
y = np.array([1, 2, 2, 2, 2, 2, 2, 4])

dtw, d = similaritymeasures.dtw(x, y)

print(dtw)
print(d)

So, my question is why is a 2-Dimensional Array required here? What is similarity measures doing that the other libraries are not?

And if Similarity measures requires data of (M, N) shape, where M is the number of data points, and N is the number of dimensions, then where does my data go? Or, phrased differently, M is the number of data points, so in the above examples x has 5 data points. And N is the number of dimensions, and in the above examples x has one dimension. So am I passing it [5, 1]? This doesn't seem right for obvious reasons, but I can't find any sample code that makes this any clearer.

My reason for wanting to use similaritymeasures is that it has multiple other functions that I would like to leverage, such as Fretchet Distance and Hausdorff distance. I'd really like to understand how to utilize it.

I really appreciate any help.


Solution

  • It appears the solution in my case was to include the index in the array. For example, if your data looks like this:

    x = [1, 2, 3, 3, 7]
    y = [1, 2, 2, 2, 2, 2, 2, 4]
    

    It needs to look like this:

    x = [[1, 1], [2, 2], [3, 3], [4, 3], [5, 7]]
    y = [[1, 1], [2, 2], [3, 2], [4, 2], [5, 2], [6, 2], [7, 2], [8, 4]]
    

    In my case, x and y were two separate columns in a pandas dataframe. My solution was as follows:

    df['index'] = df.index
    
    x1 = df['index']
    y1 = df['column1']
    P = np.array([x1, y1]).T
    
    x2 = df['index']
    y2 = df['column2']
    Q = np.array([x2, y2]).T
    
    dtw, d = similaritymeasures.dtw(P, Q)
    
    print(dtw)