Search code examples
pythonscikit-learncluster-analysisdata-sciencecorrelation

Clustering data with Python based on their correlation


I would like to cluster the following set of data in two clusters corresponding to each line ("\" and "/" ) of the "X". I was thinking that it could be done using the Pearson correlation coefficients as distance metric in Scikit-learn Agglomerative clustering as indicated here (How to use Pearson Correlation as distance metric in Scikit-learn Agglomerative clustering). But it doesn't seem to work.

Plot of the raw data

Plot of the raw data

Data:
-6.5955882 11.344538
-6.1911765 12.027311
-5.4191176 10.346639
-4.7573529 7.5105042
-2.9191176 7.7205882
-1.5955882 6.6176471
-2.9558824 6.039916
-1.1544118 3.9915966
-0.088235294 4.7794118
-0.088235294 2.8361345
0.53676471 -1.2079832
2.7794118 0
3.4044118 -4.3592437
5.2794118 -3.9915966
6.75 -8.5609244
7.4485294 -6.8802521
5.1691176 -5.7247899
-7.1470588 -2.8361345
-6.7058824 -1.2605042
-4.4264706 -1.1554622
-3.5073529 0.78781513
-0.86029412 0.31512605
-1.0808824 2.1533613
-2.8823529 -0.42016807
1.0514706 2.2584034
1.9338235 4.4117647
4.6544118 5.5147059
3.7352941 7.0378151
6.0147059 8.2457983
7.0808824 7.7205882

The code I've tried:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.stats import pearsonr

nc=2
data = np.loadtxt("cross-data_2.dat")
plt.scatter(data[:,0], data[:,1], s=100, cmap='viridis')

def pearson_affinity(M):
   return 1 - np.array([[pearsonr(a,b)[0] for a in M] for b in M])

hc = AgglomerativeClustering(n_clusters=nc, affinity = pearson_affinity, linkage = 'average')
y_hc = hc.fit_predict(data)

plt.figure()
plt.scatter(data[y_hc ==0,0], data[y_hc == 0,1], s=100, c='red')
plt.scatter(data[y_hc==1,0], data[y_hc == 1,1], s=100, c='black')

plt.show()

The results of the clustering:

Plot of the clustered data

Is there something wrong in the code or should I simply use another method?


Solution

  • I propose yet another method for this, Gaussian Mixture Models.

    X = (your data)
    from sklearn.mixture import GaussianMixture
    gmm = GaussianMixture(n_components=2,
                          init_params='random',
                          n_init=5,
                          random_state=123)
    y_pred = gmm.fit_predict(X)
    plt.scatter(*X.T, c=y_pred)
    

    enter image description here