I have 4 clusters and I need to find the set of most influential features in each cluster so that I can get some insight about the characteristics of the cluster and thus to understand the behavior of these clusters. How can I do this?
A rudimentary method of addressing the problem is by finding the descriptive statistics for the features of the cluster centroids.
Snippet to find the most influencing variables:
var_influence=cc.describe() #cc contains the cluster centroids
# The descriptive statistics of the cluster centroids are saved in a Dataframe var_influence.
# Sorting by standard deviation will give the variables with high standard deviation.
var_influence.sort_values(axis=1, by='std', ascending=False).iloc[:,:10]
This way it is quicker and better to find the influencing variables when compared to the box plot way (Which is hard to visualise with increasing features). As all the variables are normalised it is very easy to compare across features.
A max-min approach can also be used, this will allow us to see the variables with maximum bandwidth. As all the variables are normalised the max-min is a good way to validate the above result.Code for the same below
pd.Series(var_influence.loc['max']-var_influence.loc['min']).sort_values(ascending=False)[:10]
Multiclass classification
A more serious approach to find the influencing features is Multi-class classification: The cluster labels are used as a target variable to train a multi-class classification model on the data. The resulting model coefficients can be used to determine the importance of the features.