Below are 2 sets of code that do the same thing one in Python the other in R. They both graph the Kmeans the same with respect to PCA but once I do the bar chart at the end using the cluster Center the Graphs are totally different. I believe there is something wrong about the Kmeans and the cluster calculation in python. The original code was provided in R. I am trying to see why the bar chart in python does not match are I believe its the centers. Please review and provide some feed back.
Please use the link below to download the data set I used to generate these graphs.
https://www.dropbox.com/s/fhnxxrjl07y0h2c/TableStats2.csv?dl=0
R Code
## Retrive Libraries needed for script
library("ggplot2")
library("reshape2")
pcp <- read.csv(file='E:\\ProgramData\\R\\Code\\TableStats2.csv')
#Label each row with table Name to Plot names on chart.
data <- pcp
rownames(data) <- data[, 1]
#Gather all the data and leave out Table Names
data <- data[, -1]
data <- data[, -1]
#Create The PCA (Principle Component Analysis)
data <- scale(data)
pca <- prcomp(data)
plot.data <- data.frame(pca$x[, 1:2])
set.seed(2121)
clusters <- kmeans(data, 6)
plot.data$clusters <- factor(clusters$cluster)
g <- ggplot(plot.data, aes(x = PC1, y = PC2, colour = clusters)) +
geom_point(size = 3.5) +
geom_text(label = rownames(data), colour = "darkgrey", hjust = .7) +
theme_bw()
behaviours <- data.frame(clusters$centers)
behaviours$cluster <- 1:6
behavious <- melt(behaviours, "cluster")
g2 <- ggplot(behavious, aes(x = variable, y = value)) +
geom_bar(stat = "identity", position = 'identity', fill = "steelblue") +
facet_wrap(~cluster) +
theme_grey() +
theme(axis.text.x = element_text(angle = 90))
python code
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
from plotnine import ggplot, aes, geom_line, geom_bar, facet_wrap, theme_grey, theme, element_text
TableStats = pd.read_csv(r'E:\ProgramData\R\Code\TableStats2.csv')
sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables
features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values
x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
plt.figure(figsize=(20,11))
plot = plt.scatter(x1,y1, c=y.index.tolist())
for i, label in enumerate(y):
#print(label)
plt.annotate(label,(x1[i], y1[i]))
plt.show()
df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)'])
clusters = KMeans(n_clusters=6,init='k-means++', random_state=2121).fit(df)
df['Cluster'] = clusters.labels_
df['Cluster Centroid D1'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0])
df['Cluster Centroid D2'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1])
df['tables'] = tables
#print Table Names
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000, palette='coolwarm', legend=False, alpha=0.1)
for line in range(0,df.shape[0]):
ax.text(x1[line]+0.05, y1[line],TableStats['Object Name'][line], horizontalalignment='left',size='medium', color='black',weight='semibold')
plt.legend(loc='upper right', title='Cluster')
ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
plt.show()
# here is where the R and Python graphs are different because the cluster centers dont match
behaviours = pd.DataFrame(clusters.cluster_centers_)
behaviours.columns = clusters.feature_names_in_
behaviours['cluster'] = [1,2,3,4,5,6]
b2 = pd.melt(behaviours, id_vars = "cluster",value_name="value")
(ggplot(b2, aes(x = 'variable', y = 'value')) +
geom_bar(stat = "identity", position = 'identity', fill = "steelblue") +
facet_wrap('~cluster') +
theme_grey() +
theme(axis_text_x = element_text(rotation = 90, hjust=1), figure_size=(20,8))
)
Update now I have this working in R and Python
Looking at this specific problem, check the outputs of the PCA - they're different, so k-means won't be the same. The reason is in your R code - you repeat the line data <- data[, -1]
, dropping the table names and the first column of the data. Remove the extra line, and the clusters look the same.
General comments on R and Python implementation of kmeans
In general, it looks like R and python use different algorithms by default. R uses "Hartigan-Wong"
by default, and Python's scikit-learn probably uses "elkan"
. Set algorithm='Lloyd'
in R and algorithm='full'
in Python (which I believe currently will run Lloyd's algorithm as well) to ensure they're at least attempting the same thing.
You also have different initialisation methods - R is random and for Python you are using 'k-means++'
. Set init='random'
in Python to make these match.
They have different numbers of max iteartions - R defaults to 10, Python to 300. Set these as equal also.
Finally, you won't see any random variation in your python script if you set the random_state
in the Python KMeans call (and check you haven't set.seed
in R also).
Once you've done this, try running both multiple times, and compare the distributions of values. Hopefully you'll see overlap between the two implementations.
Check out the docs for the R implementation and the scikit-learn implementation.
And a final point here - kmeans is unsupervised. The class labels have no absolute meaning. Run the code multiple times, and class 0 will not always be assigned to the same data points, even if data points are grouped identically.
Here's a reproducible example of this:
import pandas as pd
from sklearn import cluster, datasets
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
X, y = datasets.make_blobs(100,2,centers=6)
df = pd.DataFrame(X)
random_states = list(range(0,60,10))
fig, ax = plt.subplots(3,2, figsize=(20,16))
for i, r in enumerate(random_states):
clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(X)
df = (df
.assign(**{
'Cluster': clusters.labels_,
'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
})
)
row = i//2
col = i - row*2
sns.scatterplot(data=df, x=0, y=1, hue='Cluster', s=200, palette='coolwarm', legend=True, ax=ax[row,col])
sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000,
palette='coolwarm', legend=False, alpha=0.1, ax=ax[row,col])
Here's a version with your data:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
TableStats = pd.read_csv('TableStats2.csv')
sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables
features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values
x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
random_states = [1,2,3,4,5,6]
for r in random_states:
df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)'])
clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(df)
df = (df
.assign(**{
'Cluster': clusters.labels_,
'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
})
)
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster',
s=1000, palette='coolwarm', legend=False, alpha=0.1)
plt.legend(loc='upper right', title='Cluster')
ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
plt.show()