Search code examples
pythonpandasimportk-meanssklearn-pandas

Using SKLearn KMeans With Externally Generated Correlation Matrix


I receive a correlation file from an external source. It is a fairly straightforward file and looks like the following.

enter image description here

A sample csv can be found here

https://www.dropbox.com/scl/fi/1ytmnk23zb70twns2owsi/corrmatrix.csv?rlkey=ev6ya520bc0n94yfqswasi3o6&st=p4vntit1&dl=0

I want to use this file to do some kmeans clustering and I am using the code that follows:

import pandas as pd
correlation_mat=pd.read_csv("C:/temp/corrmatrix.csv",index_col=False)

from sklearn.cluster import KMeans

# Utility function to print the name of companies with their assigned cluster
def print_clusters(df_combined,cluster_labels):
  cluster_dict = {}
  for i, label in enumerate(cluster_labels):
      if label not in cluster_dict:
          cluster_dict[label] = []
      cluster_dict[label].append(df_combined.columns[i])

  # Print out the companies in each cluster
  for cluster, companies in cluster_dict.items():
      print(f"Cluster {cluster}: {', '.join(companies)}")

# Perform k-means clustering with four clusters
clustering = KMeans(n_clusters=4, random_state=0).fit(correlation_mat)

# Print the cluster labels
cluster_labels=clustering.labels_
print_clusters(correlation_mat,cluster_labels)

Even though this file looks like a correlation file as generated by Pandas, I cannot get it to work.

I keep getting the following error

ValueError: could not convert string to float: 'ABBV'

How can I get this file to work with SKLearn? I merely receive the data from a third party, so regenerating the correlations myself is not an option

Is there a way to have SKLearn see this as it would see a Pandas generated correlation file?

Would very much appreciate any help that can be provided


Solution

  • Your issue is caused because you are not specifying the correct index when reading the csv to your correlation_mat. The first column should obviously be the index in the case of correlation matrix as it matches the header. So the only thing you need to do to fix the issue is to specify the first column as index like below:

    correlation_mat=pd.read_csv("C:/temp/corrmatrix.csv",index_col=0)