Search code examples
pythonrcluster-analysisk-means

K-Means R vs K-Means Python different cluster values generating different bar Graphs


Below are 2 sets of code that do the same thing one in Python the other in R. They both graph the Kmeans the same with respect to PCA but once I do the bar chart at the end using the cluster Center the Graphs are totally different. I believe there is something wrong about the Kmeans and the cluster calculation in python. The original code was provided in R. I am trying to see why the bar chart in python does not match are I believe its the centers. Please review and provide some feed back.

Please use the link below to download the data set I used to generate these graphs.

https://www.dropbox.com/s/fhnxxrjl07y0h2c/TableStats2.csv?dl=0

R Code

## Retrive Libraries needed for script
library("ggplot2")
library("reshape2")
  

pcp <- read.csv(file='E:\\ProgramData\\R\\Code\\TableStats2.csv')

#Label each row with table Name to Plot names on chart.
data <- pcp
rownames(data) <- data[, 1]


#Gather all the data and leave out Table Names
data <- data[, -1]
data <- data[, -1]

#Create The PCA (Principle Component Analysis)
data <- scale(data)
pca <- prcomp(data)

plot.data <- data.frame(pca$x[, 1:2])

set.seed(2121)
clusters <- kmeans(data, 6)
plot.data$clusters <- factor(clusters$cluster)

g <- ggplot(plot.data, aes(x = PC1, y = PC2, colour = clusters)) +
  geom_point(size = 3.5) +
  geom_text(label = rownames(data), colour = "darkgrey", hjust = .7) +
  theme_bw()

behaviours <- data.frame(clusters$centers)
behaviours$cluster <- 1:6
behavious <- melt(behaviours, "cluster")

g2 <- ggplot(behavious, aes(x = variable, y = value)) +
  geom_bar(stat = "identity", position = 'identity', fill = "steelblue") +
  facet_wrap(~cluster) +
  theme_grey() +
  theme(axis.text.x = element_text(angle = 90)) 

python code

import pandas as pd    
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans    
from matplotlib import pyplot as plt    
from plotnine import ggplot, aes, geom_line, geom_bar, facet_wrap, theme_grey, theme, element_text

TableStats = pd.read_csv(r'E:\ProgramData\R\Code\TableStats2.csv')

sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables

features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values

x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
plt.figure(figsize=(20,11))
plot = plt.scatter(x1,y1, c=y.index.tolist())
for i, label in enumerate(y):
  #print(label)
  plt.annotate(label,(x1[i], y1[i]))
plt.show()

df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']) 

clusters = KMeans(n_clusters=6,init='k-means++', random_state=2121).fit(df)

df['Cluster'] = clusters.labels_
df['Cluster Centroid D1'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0])
df['Cluster Centroid D2'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1])
df['tables'] = tables

#print Table Names
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000, palette='coolwarm', legend=False, alpha=0.1)
for line in range(0,df.shape[0]):
    ax.text(x1[line]+0.05, y1[line],TableStats['Object Name'][line], horizontalalignment='left',size='medium', color='black',weight='semibold')

plt.legend(loc='upper right', title='Cluster')
ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
plt.show()

# here is where the R and Python graphs are different because the cluster centers dont match
behaviours = pd.DataFrame(clusters.cluster_centers_)
behaviours.columns = clusters.feature_names_in_
behaviours['cluster'] = [1,2,3,4,5,6]

b2 = pd.melt(behaviours, id_vars = "cluster",value_name="value")

(ggplot(b2, aes(x = 'variable', y = 'value')) + 
geom_bar(stat = "identity", position = 'identity', fill = "steelblue") + 
facet_wrap('~cluster') + 
theme_grey() + 
theme(axis_text_x = element_text(rotation = 90, hjust=1), figure_size=(20,8)) 
)

Solution

  • Update now I have this working in R and Python

    Looking at this specific problem, check the outputs of the PCA - they're different, so k-means won't be the same. The reason is in your R code - you repeat the line data <- data[, -1], dropping the table names and the first column of the data. Remove the extra line, and the clusters look the same.


    General comments on R and Python implementation of kmeans

    In general, it looks like R and python use different algorithms by default. R uses "Hartigan-Wong" by default, and Python's scikit-learn probably uses "elkan". Set algorithm='Lloyd' in R and algorithm='full' in Python (which I believe currently will run Lloyd's algorithm as well) to ensure they're at least attempting the same thing.

    You also have different initialisation methods - R is random and for Python you are using 'k-means++'. Set init='random' in Python to make these match.

    They have different numbers of max iteartions - R defaults to 10, Python to 300. Set these as equal also.

    Finally, you won't see any random variation in your python script if you set the random_state in the Python KMeans call (and check you haven't set.seed in R also).

    Once you've done this, try running both multiple times, and compare the distributions of values. Hopefully you'll see overlap between the two implementations.

    Check out the docs for the R implementation and the scikit-learn implementation.

    And a final point here - kmeans is unsupervised. The class labels have no absolute meaning. Run the code multiple times, and class 0 will not always be assigned to the same data points, even if data points are grouped identically.

    Here's a reproducible example of this:

    import pandas as pd
    from sklearn import cluster, datasets
    
    from matplotlib import pyplot as plt
    import seaborn as sns
    from sklearn.cluster import KMeans
    
    X, y = datasets.make_blobs(100,2,centers=6)
    df = pd.DataFrame(X)
    
    random_states = list(range(0,60,10))
    fig, ax = plt.subplots(3,2, figsize=(20,16))
    for i, r in enumerate(random_states):
    
        clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(X)
    
        df = (df
          .assign(**{
              'Cluster': clusters.labels_,
              'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
              'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
          })
         )
        
        row = i//2
        col = i - row*2
        sns.scatterplot(data=df, x=0, y=1, hue='Cluster', s=200, palette='coolwarm', legend=True, ax=ax[row,col])
        sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000, 
                        palette='coolwarm', legend=False, alpha=0.1, ax=ax[row,col])   
    

    Here's a version with your data:

    import pandas as pd
    from matplotlib import pyplot as plt
    import seaborn as sns
    
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import StandardScaler
    
    TableStats = pd.read_csv('TableStats2.csv')
    
    sc = StandardScaler()
    pca = PCA()
    tables = TableStats.iloc[:,0]
    y = tables
    
    features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
                'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
    
    # Separating out the features
    x = TableStats.loc[:, features].values
    
    x = sc.fit_transform(x)
    dpca = pca.fit_transform(x)
    x1 = dpca[:,0]
    y1 = dpca[:,1]
    
    random_states = [1,2,3,4,5,6]
    for r in random_states:
        df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
                                          'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']) 
        clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(df)
    
        df = (df
              .assign(**{
                  'Cluster': clusters.labels_,
                  'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
                  'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
              })
             )
        
        plt.figure(figsize=(20, 11))
        ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
        ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', 
                             s=1000, palette='coolwarm', legend=False, alpha=0.1)    
    
        plt.legend(loc='upper right', title='Cluster')
        ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
        plt.show()