I'm trying to cluster time series. The intra-cluster elements have same shapes but different scales. Therefore, I would like to use a correlation measure as metric for clustering. I'm trying correlation or pearson coefficient distance (any suggestion or alternative is welcome). However, the following code returns error when I run Z = linkage(dist) because there are some NaN values in dist. There are not NaN values in time_series, this is confirmed by
np.any(isnan(time_series))
which returns False
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import dendrogram, linkage
dist = pdist(time_series, metric='correlation')
Z = linkage(dist)
fig = plt.figure()
dn = dendrogram(Z)
plt.show()
As alternative, I will use pearson distance
from scipy.stats import pearsonr
def pearson_distance(a,b):
return 1 - pearsonr(a,b)[0]
dist = pdist(time_series, pearson_distance)`
but this generates some runtime warnings and takes a lot of time.
scipy.pdist(time_series, metric='correlation')
If you take a look at the manual, the correlation
options divides by the difference. So it could be that you have two timestamps that are the same, and dividing zero
by zero
gives us NaN
.