I'm doing a project analyzing page visits to an e-commerce website. It monitors numerical, numerical discrete (continuous numbers but only integers), and categorical variables.
My understanding is that due to KMeans' nature of taking means and performing calculations on the numbers/distances, it does not work very well with categorical variables. I also don't think it works well with numerical discrete values because it will interpret them using decimals when there shouldn't be fractions of these discrete values.
Here is the code for how I run sklearn's KMeans, measuring k clusters with silhouette score and using the highest score's k clusters. I create a dataframe called cluster_df of only the numerical features from my original dataframe, and then separate dataframes for each cluster:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cluster_df[cluster_attribs] = scaler.fit_transform(cluster_df[cluster_attribs])
k_rng = range(2,10)
silhouette = []
for k in k_rng:
kmeans = KMeans(n_clusters=k)
kmeans.fit(cluster_df[cluster_attribs])
silhouette.append(silhouette_score(cluster_df[cluster_attribs], kmeans.labels_))
kmeans = KMeans(n_clusters=3)
y_pred = kmeans.fit_predict(cluster_df[cluster_attribs])
cluster_df['cluster'] = y_pred
# inverse StandardScaler to return values to normal
cluster_df[cluster_attribs] = scaler.inverse_transform(cluster_df[cluster_attribs])
cluster0 = cluster_df[cluster_df.cluster==0]
cluster1 = cluster_df[cluster_df.cluster==1]
cluster2 = cluster_df[cluster_df.cluster==2]
I then perform data visualizations/analysis based on these 3 clusters. It seems to work pretty well clustering the data, and even when viewing the categorical data it seems to be clustered with those in mind even though they weren't included in the actual clustering.
For instance, Revenue is a binary column I didn't include in KMeans. But my 3 clusters seem to have separated my customers well into low-revenue, medium-revenue, and high-revenue just by running it on the numerical variables.
My questions are:
1) Is it true that KMeans only works well with numerical data, not discrete numerical or categorical data? (I've read there are ways to convert categorical variables to numerical but it seemed complicated and not reliably accurate due to its nature for this project. I know OneHotEncoder/LabelEncoder/MultiLabelBinarizer but I mean converting them keeping the categories' distances from each other in mind which is more complicated).
2) Is it an acceptable strategy to run KMeans on just your numerical data, separate into clusters, and then pull insights on your data's clusters for all of your variables (numerical, discrete numerical, categorical) by seeing how they've been separated?
1)
2)