Search code examples
pythonnancorrelationpearson-correlationpdist

scipy.pdist() returns NaN values


I'm trying to cluster time series. The intra-cluster elements have same shapes but different scales. Therefore, I would like to use a correlation measure as metric for clustering. I'm trying correlation or pearson coefficient distance (any suggestion or alternative is welcome). However, the following code returns error when I run Z = linkage(dist) because there are some NaN values in dist. There are not NaN values in time_series, this is confirmed by

np.any(isnan(time_series))

which returns False

from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import dendrogram, linkage

dist = pdist(time_series, metric='correlation') 
Z = linkage(dist)
fig = plt.figure()
dn = dendrogram(Z)
plt.show()

As alternative, I will use pearson distance

from scipy.stats import pearsonr

def pearson_distance(a,b):
    return 1 - pearsonr(a,b)[0]

dist = pdist(time_series, pearson_distance)`

but this generates some runtime warnings and takes a lot of time.


Solution

  • scipy.pdist(time_series, metric='correlation')
    

    If you take a look at the manual, the correlation options divides by the difference. So it could be that you have two timestamps that are the same, and dividing zero by zero gives us NaN.