Search code examples
pythonmachine-learningscikit-learnk-meanspca

How to put importance coefficients to features before kmeans?


Lets say I have the given dataframe

   feature_1  feature_2  feature_3  feature_4  feature_5  feature_6  feature_7  feature_8
0   0.862874   0.392938   0.669744   0.939903   0.382574   0.780595   0.049201   0.627703
1   0.942322   0.676181   0.223476   0.102698   0.620883   0.834038   0.966355   0.554645
2   0.940375   0.310532   0.975096   0.600778   0.893220   0.282508   0.837575   0.112575
3   0.868902   0.818175   0.102860   0.936395   0.406088   0.619990   0.913905   0.597607
4   0.143344   0.207751   0.835707   0.414900   0.360534   0.525631   0.228751   0.294437
5   0.339856   0.501197   0.671033   0.302202   0.406512   0.997044   0.080621   0.068071
6   0.521056   0.343654   0.812553   0.393159   0.217987   0.247602   0.671783   0.254299
7   0.594744   0.180041   0.884603   0.578050   0.441461   0.176732   0.569595   0.391923
8   0.402864   0.062175   0.565858   0.349415   0.106725   0.323310   0.153594   0.277930
9   0.480539   0.540283   0.248376   0.252237   0.229181   0.092273   0.546501   0.201396

And I would like to find clusters in these rows. To do so, I want to use Kmeans. However, I would like to find clusters by giving more importance to [feature_1, feature_2] than to the other features in the dataframe. Lets say an importance coefficient of 0.5 for [feature_1, feature_2] , and 0.5 for the remaining features.

I thought about transforming [feature_3, ..., feature_8] into a single column by using PCA. By doing so, I imagine that the Kmeans would give less importance to a single feature than to 6 separated features.

Is it a good idea ? Do you see better ways of giving this information to the algorithm ?


Solution

  • What Kmeans does is it tries to find centroids and assigns points to those centroids that have the smallest euclidean distance to the centroid. When minimizing euclidean distances or using them as loss functions in machine learning, one should in general make sure that different features have the same scale. Otherwise larger features would dominate in finding the closest points. That's why we normally do some scaling before training our models.

    However, in your case, you could make use of that by first bringing all features onto the same scale using some minmax or standarscaler, and after that either scale up the first 2 features by a factor > 1 or scale down the remaining 6 features by a factor < 1.