machine-learning data-science feature-engineering

Fit clustering outputs into Machine Learning model

Just a machine learning/data science problem.

a) Let's say I have a dataset of 20 features, and i decide to use 3 features to perform unsupervised learning of clustering - and ideally this produces 3 clusters (A,B and C).

b) Then i fit that output result (cluster A, B or C) back into my dataset as a new feature (i.e. now total of 21 features).

c) I run a regression model to predict a label value with the 21 features.

Wonder if step b) is redundant (since the features already exist in the earlier dataset), if I use a more powerful model (Random forest, XGBoost), or not, and how to explain this mathematically.

Any opinions and suggestions will be great!

Solution

Great idea: just give it a try and see how that goes. This is highly dependent on your dataset and model choice as you guessed. Hard to predict how adding this type of feature will behave, just like any other feature engineering. But caution, in some cases it's not even improving your performance. See a test below where performance actually decreases, with Iris dataset:

import numpy as np
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn import metrics

# load data
iris = load_iris()
X = iris.data[:, :3]  # only keep three out of the four available features to make it more challenging
y = iris.target

# split train / test
indices = np.random.permutation(len(X))
N_test = 30
X_train, y_train = X[indices[:-N_test]], y[indices[:-N_test]]
X_test, y_test = X[indices[N_test:]], y[indices[N_test:]]

# compute a clustering method (here KMeans) based on available features in X_train
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)
new_clustering_feature_train = kmeans.predict(X_train)
new_clustering_feature_test = kmeans.predict(X_test)

# create a new input train/test X with this feature added
X_train_with_clustering_feature = np.column_stack([X_train, new_clustering_feature_train])
X_test_with_clustering_feature = np.column_stack([X_test, new_clustering_feature_test])

Now let's compare the two models that learnt either only on X_train or on X_train_with_clustering_feature:

model1 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train, y_train)
print(metrics.classification_report(model1.predict(X_test), y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        45
           1       0.95      0.97      0.96        38
           2       0.97      0.95      0.96        37

    accuracy                           0.97       120
   macro avg       0.97      0.97      0.97       120
weighted avg       0.98      0.97      0.97       120

And the other model:

model2 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train_with_clustering_feature, y_train)
print(metrics.classification_report(model2.predict(X_test_with_clustering_feature), y_test))

           0       1.00      1.00      1.00        45
           1       0.87      0.97      0.92        35
           2       0.97      0.88      0.92        40

    accuracy                           0.95       120
   macro avg       0.95      0.95      0.95       120
weighted avg       0.95      0.95      0.95       120