Search code examples
pythondataframematrixscipyhierarchical-clustering

Clustering data using scipy and a distance matriz in Python


I am working in Python. I am using a binary dataframe in which I have a ser of values of 0 and 1 for diferent users at diferent times.

I can perform hierarchical clustering directly from the dataframe as

    metodo='average'
    clusters = linkage(user_df, method=metodo,metric='hamming')
    
    # Create a dendrogram
    plt.figure(figsize=(10, 7))
    dendrogram(clusters, labels=user_df.index, leaf_rotation=90)
    plt.title('Hierarchical Clustering Dendrogram')
    plt.xlabel('User')
    plt.ylabel('Distance')
# Save the figure
plt.savefig(f'dendrogram_{metodo}_entero.png')
plt.show()

However, I want to separate the calculation of the distance matrix and the clustering. To do that, I have calculated the distance matrix and I have sent it as an argument to the clustering.

dist_matrix = pdist(user_df.values, metric='hamming')

# Convert the distance matrix to a square form
dist_matrix_square = squareform(dist_matrix)

# Create a DataFrame from the distance matrix
dist_df = pd.DataFrame(dist_matrix_square, index=user_df.index, columns=user_df.index)

clusters = linkage(dist_df, method=metodo)

Unfortunately, the results that I obtain are different with both methodologies. As far as I know, the first code is the correct one.

So I don't know if I can calculate the distance matrix and then use it somehow as an argument for clustering.


Solution

  • pdist returns a numpy array that is the condensed distance matrix. You can pass this form of the distance matrix directly to linkage. Don't convert it to a Pandas DataFrame.

    So your code could be as simple as:

    dist_matrix = pdist(user_df.values, metric='hamming')
    clusters = linkage(dist_matrix, method=metodo)