python pandas scikit-learn cluster-analysis k-means

Understand customer attributes from kmeans clustering

I have a customer data set with about 20-25 attributes about the customer such as:

age
gender_F
gender_M
num_purchases
loyalty_status_new
loyalty_status_intermediate
loyalty_status_advanced
...

I have cleaned my dataset to not have any null values and have one-hot encoded categorical variables as well into a pandas dataframe my_df. I have used scikit-learn's kmeans to create 2 clusters on this dataset, but I would like to understand how to tell which customers were clustered into which clusters.

    scaler = StandardScaler()
    my_df_scaler = scaler.fit_transform(my_df)
    kmeans = KMeans(2)
    model = kmeans.fit(my_df_scaler)
    preds = model.predict(my_df_scaler)

Basically, I am looking for some help in getting insights like:

Cluster 1 represents people with larger values for age and loyalty_status_new

Thanks in advance!

Solution

If you have the clusters for each customer, you can compute the average by cluster for each parameters and you will have your answer. You can check more generally the distribution of each parameters in each clusters and compare them between clusters.

Yet, as I see your parameters, you should not take Gender_M and Gender_F as these features are correlated (Gender_M=1-Gender_F).

I see also loyalty status new, intermediate and advanced... If these parameters are computed from a continuous variable, you should keep the continuous variables and not go with three related variables like this.

Anyway here are some links that should help you for your clustering: - rfm clustering principles: https://towardsdatascience.com/apply-rfm-principles-to-cluster-customers-with-k-means-fef9bcc9ab16 - go deeper in KMeans understanding: https://towardsdatascience.com/k-means-clustering-8e1e64c1561c