Search code examples
machine-learningdata-sciencefeature-engineering

Fit clustering outputs into Machine Learning model


Just a machine learning/data science problem.

a) Let's say I have a dataset of 20 features, and i decide to use 3 features to perform unsupervised learning of clustering - and ideally this produces 3 clusters (A,B and C).

b) Then i fit that output result (cluster A, B or C) back into my dataset as a new feature (i.e. now total of 21 features).

c) I run a regression model to predict a label value with the 21 features.

Wonder if step b) is redundant (since the features already exist in the earlier dataset), if I use a more powerful model (Random forest, XGBoost), or not, and how to explain this mathematically.

Any opinions and suggestions will be great!


Solution

  • Great idea: just give it a try and see how that goes. This is highly dependent on your dataset and model choice as you guessed. Hard to predict how adding this type of feature will behave, just like any other feature engineering. But caution, in some cases it's not even improving your performance. See a test below where performance actually decreases, with Iris dataset:

    import numpy as np
    from sklearn.cluster import KMeans
    from sklearn.linear_model import LogisticRegression
    from sklearn.datasets import load_iris
    from sklearn.svm import SVC
    from sklearn import metrics
    
    # load data
    iris = load_iris()
    X = iris.data[:, :3]  # only keep three out of the four available features to make it more challenging
    y = iris.target
    
    # split train / test
    indices = np.random.permutation(len(X))
    N_test = 30
    X_train, y_train = X[indices[:-N_test]], y[indices[:-N_test]]
    X_test, y_test = X[indices[N_test:]], y[indices[N_test:]]
    
    # compute a clustering method (here KMeans) based on available features in X_train
    kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)
    new_clustering_feature_train = kmeans.predict(X_train)
    new_clustering_feature_test = kmeans.predict(X_test)
    
    # create a new input train/test X with this feature added
    X_train_with_clustering_feature = np.column_stack([X_train, new_clustering_feature_train])
    X_test_with_clustering_feature = np.column_stack([X_test, new_clustering_feature_test])
    

    Now let's compare the two models that learnt either only on X_train or on X_train_with_clustering_feature:

    model1 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train, y_train)
    print(metrics.classification_report(model1.predict(X_test), y_test))
    
                  precision    recall  f1-score   support
    
               0       1.00      1.00      1.00        45
               1       0.95      0.97      0.96        38
               2       0.97      0.95      0.96        37
    
        accuracy                           0.97       120
       macro avg       0.97      0.97      0.97       120
    weighted avg       0.98      0.97      0.97       120
    

    And the other model:

    model2 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train_with_clustering_feature, y_train)
    print(metrics.classification_report(model2.predict(X_test_with_clustering_feature), y_test))
    
               0       1.00      1.00      1.00        45
               1       0.87      0.97      0.92        35
               2       0.97      0.88      0.92        40
    
        accuracy                           0.95       120
       macro avg       0.95      0.95      0.95       120
    weighted avg       0.95      0.95      0.95       120