Search code examples
pythonmachine-learningscikit-learnclassificationsupervised-learning

Stacking ensemble of classifiers in a chain


I have the following human activity recognition sample dataset:

df = pd.DataFrame(
    { 
  'mean_speed': [40.01, 3.1, 2.88, 20.89, 5.82, 40.01, 33.1, 40.88, 20.89, 5.82, 40.018, 23.1], 
  'max_speed': [70.11, 6.71, 7.08, 39.63, 6.68, 70.11, 65.71, 71.08, 39.63, 13.68, 70.11, 35.71],
  'max_acc': [17.63, 2.93, 3.32, 15.57, 0.94, 17.63, 12.93, 3.32, 15.57, 0.94, 17.63, 12.93], 
  'mean_acc': [5.15, 1.97, 0.59, 5.11, 0.19, 5.15, 2.97, 0.59, 5.11, 0.19, 5.15, 2.97],
  'activity': ['driving', 'walking', 'walking', 'riding', 'walking', 'driving', 'motor-bike',
               'motor-bike', 'riding', 'riding', 'motor-bike', 'riding']
}
)
df.head()
  mean_speed max_speed  max_acc mean_acc    activity
0   40.01     70.11     17.63    5.15       driving
1    3.10      6.71      2.93    1.97       walking
2    2.88      7.08      3.32    0.59       walking
3   20.89     39.63     15.57    5.11       riding
4    5.82      6.68      0.94    0.19       walking

So I want to create a chain of machine learning classifiers in a pipepline. Where the base classifier first predicts whether an activity is a mototised (driving, motor-bike), a non-mototised (riding, walking). The learning phase should proceed like so: enter image description here

So I add a column type stating where an activity is motorised or otherwise.

class_mapping = {'driving':'motorised', 'motor-bike':'motorised', 'walking':'non-motorised', 'riding':'non-motorised'}
df['type'] = df['activity'].map(class_mapping)

df.head()
 mean_speed max_speed   max_acc mean_acc    activity        type
0   40.01    70.11       17.63    5.15      driving       motorised
1    3.10     6.71        2.93    1.97      walking   non-motorised
2    2.88     7.08        3.32    0.59      walking   non-motorised
3   20.89    39.63       15.57    5.11       riding   non-motorised
4    5.82     6.68        0.94    0.19      walking   non-motorised

Question:

I would like to train a Random Forest as base classifier, to predict whether an activity is motorised or non-motorised, with a probability output. Then follows 2 meta-classifiers: Decision Tree to predict if the activity is walking or riding, and an SVC which predicts if an activity is driving or motor-bike. The meta-classifiers (DT, SVC) would take as input, the 4-features + probability output of the first classifier. Obviously, DT and SVC would only take a subset of the entire dataset corresponding to the classes they would predict.

I have this idea of the learning procedure, but I am not sure how I to implement it.

Can anyone out there show how this could be done?


Solution

  • What you're asking is impossible using core Scikit-Learn classes. But if you're open to the idea of using some 3rd party packages, then there's nothing difficult at all.

    First, enrich your dataset with a feature that corresponds to the probability of the motorised/non-motorised decision:

    from sklego.preprocessing import IdentityTransformer
    from sklego.meta import EstimatorTransformer
    
    feature_estimator = FeatureUnion([
      ("identity", IdentityTransformer()),
      ("type_estimator", EstimatorTransformer(RandomForestClassifier(), predict_func = "predict_proba"))
    ])
    

    This will append two columns (corresponding to the output of RandomForestClassifier.predict_proba(X)) to your original data matrix.

    Second, set up the final classifier ensemble. Use the probability of the "motorized" class to make the decision (eg. applying a 75% threshold). Please note that the index of this "flag" feature is second-to-last, ie. X[-2].

    from sklearn2pmml.ensemble import EstimatorChain
    
    final_estimator_chain = EstimatorChain([
      ("motorized", SVC(), "X[-2] >= 0.75"),
      ("non-motorized", DecisionTreeClassifier(), "X[-2] < 0.75"),
    ], multioutput = False)
    

    Putting everything together:

    pipeline = Pipeline([
      ("step_one", feature_estimator),
      ("step_two", final_estimator_chain)
    ])
    pipeline.fit(X, y)