I have a customer data set with about 20-25 attributes about the customer such as:
I have cleaned my dataset to not have any null values and have one-hot encoded categorical variables as well into a pandas dataframe my_df. I have used scikit-learn's kmeans to create 2 clusters on this dataset, but I would like to understand how to tell which customers were clustered into which clusters.
scaler = StandardScaler()
my_df_scaler = scaler.fit_transform(my_df)
kmeans = KMeans(2)
model = kmeans.fit(my_df_scaler)
preds = model.predict(my_df_scaler)
Basically, I am looking for some help in getting insights like:
Thanks in advance!
If you have the clusters for each customer, you can compute the average by cluster for each parameters and you will have your answer. You can check more generally the distribution of each parameters in each clusters and compare them between clusters.
Yet, as I see your parameters, you should not take Gender_M and Gender_F as these features are correlated (Gender_M=1-Gender_F).
I see also loyalty status new, intermediate and advanced... If these parameters are computed from a continuous variable, you should keep the continuous variables and not go with three related variables like this.
Anyway here are some links that should help you for your clustering: - rfm clustering principles: https://towardsdatascience.com/apply-rfm-principles-to-cluster-customers-with-k-means-fef9bcc9ab16 - go deeper in KMeans understanding: https://towardsdatascience.com/k-means-clustering-8e1e64c1561c