Search code examples
pythonpandask-means

Plotting results of k-means


I have a pandas dataframe of 229 rows. Each row represents a "strain". The data comes from 4 sites. The strains are encoded with the site codes.

Once upon a time, this data was clustered and

The 229 strains examined formed a large group at the S (similarity) 231% level, using the Jaccard similarity coefficient and unweighted average linkage. Within this group, 10 clusters, or phena, were distinguished at varying levels of similarity above 65%. Twenty-one strains did not fall into any one of these phenetic groups. No cluster with less than five members was considered further.

Disclaimer: I am not a statistician; I know essentially nothing about statistics past mean and median. Way back then I had a statistician to work with. I also know next to nothing about Machine Learning algorithms although I know what clustering means from a general point of view.

I want to try to reproduce the clustering with more modern methods. I thought I'd try k-means (if that's a bad choice, please enlighten me).

The data is Boolean. I have transposed it so that each column is a "strain" and the rows are the features. (Was that right?)

data

The code:

In [106]: from sklearn.cluster import KMeans

          kmeans = KMeans(n_clusters=10)
          kmeans.fit(df_bool)

Out [106]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=10, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [107]: labels = kmeans.predict(df_bool)
          centroids = kmeans.cluster_centers_

          labels

Out [107]: array([5, 5, 2, 2, 0, 4, 9, 8, 1, 6, 1, 1, 7, 1, 3, 1, 1, 1, 1, 1],
      dtype=int32)

Question 1: Is this list of (what I assume are cluster numbers) what I should be expecting?

Question 2: How might I plot some or all of the results?

Question 3: Am I totally off base? i.e. Does K-means make sense with Boolean data? Is my DataFrame aligned properly?

Am I even asking the right questions?


Solution

  • Question 1: Yes, the output you describe is what you should expect, a number telling you which cluster the observation most likely belongs to.

    Question 2: You can't plot this. What people are generally doing when they plot kmeans is using some kind of dimension reduction to convert their vectors to two dimensions, and then plotting those as X and Y. You can then use the kmeans labels as colors for the scatter plot. See How to plot text clusters?, where I describe this process in greater detail.

    Question 3: kmeans clustering may not work very well with binary data. See https://www.ibm.com/support/pages/clustering-binary-data-k-means-should-be-avoided for alternatives. It's mostly a matter of whether the data have underlying patterns in them that form coherent groupings, and whether the method you use can capture those.