I have the following human activity recognition sample dataset:
df = pd.DataFrame(
{
'mean_speed': [40.01, 3.1, 2.88, 20.89, 5.82, 40.01, 33.1, 40.88, 20.89, 5.82, 40.018, 23.1],
'max_speed': [70.11, 6.71, 7.08, 39.63, 6.68, 70.11, 65.71, 71.08, 39.63, 13.68, 70.11, 35.71],
'max_acc': [17.63, 2.93, 3.32, 15.57, 0.94, 17.63, 12.93, 3.32, 15.57, 0.94, 17.63, 12.93],
'mean_acc': [5.15, 1.97, 0.59, 5.11, 0.19, 5.15, 2.97, 0.59, 5.11, 0.19, 5.15, 2.97],
'activity': ['driving', 'walking', 'walking', 'riding', 'walking', 'driving', 'motor-bike',
'motor-bike', 'riding', 'riding', 'motor-bike', 'riding']
}
)
df.head()
mean_speed max_speed max_acc mean_acc activity
0 40.01 70.11 17.63 5.15 driving
1 3.10 6.71 2.93 1.97 walking
2 2.88 7.08 3.32 0.59 walking
3 20.89 39.63 15.57 5.11 riding
4 5.82 6.68 0.94 0.19 walking
So I want to create a chain of machine learning classifiers in a pipepline. Where the base classifier first predicts whether an activity
is a mototised (driving
, motor-bike
), a non-mototised (riding
, walking
). The learning phase should proceed like so:
So I add a column type
stating where an activity is motorised or otherwise.
class_mapping = {'driving':'motorised', 'motor-bike':'motorised', 'walking':'non-motorised', 'riding':'non-motorised'}
df['type'] = df['activity'].map(class_mapping)
df.head()
mean_speed max_speed max_acc mean_acc activity type
0 40.01 70.11 17.63 5.15 driving motorised
1 3.10 6.71 2.93 1.97 walking non-motorised
2 2.88 7.08 3.32 0.59 walking non-motorised
3 20.89 39.63 15.57 5.11 riding non-motorised
4 5.82 6.68 0.94 0.19 walking non-motorised
Question:
I would like to train a Random Forest
as base classifier, to predict whether an activity is motorised
or non-motorised
, with a probability output. Then follows 2 meta-classifiers: Decision Tree
to predict if the activity is walking
or riding
, and an SVC
which predicts if an activity is driving
or motor-bike
. The meta-classifiers (DT, SVC
) would take as input, the 4-features + probability output of the first classifier. Obviously, DT
and SVC
would only take a subset of the entire dataset corresponding to the classes they would predict.
I have this idea of the learning procedure, but I am not sure how I to implement it.
Can anyone out there show how this could be done?
What you're asking is impossible using core Scikit-Learn classes. But if you're open to the idea of using some 3rd party packages, then there's nothing difficult at all.
First, enrich your dataset with a feature that corresponds to the probability of the motorised/non-motorised decision:
from sklego.preprocessing import IdentityTransformer
from sklego.meta import EstimatorTransformer
feature_estimator = FeatureUnion([
("identity", IdentityTransformer()),
("type_estimator", EstimatorTransformer(RandomForestClassifier(), predict_func = "predict_proba"))
])
This will append two columns (corresponding to the output of RandomForestClassifier.predict_proba(X)) to your original data matrix.
Second, set up the final classifier ensemble. Use the probability of the "motorized" class to make the decision (eg. applying a 75% threshold). Please note that the index of this "flag" feature is second-to-last, ie. X[-2].
from sklearn2pmml.ensemble import EstimatorChain
final_estimator_chain = EstimatorChain([
("motorized", SVC(), "X[-2] >= 0.75"),
("non-motorized", DecisionTreeClassifier(), "X[-2] < 0.75"),
], multioutput = False)
Putting everything together:
pipeline = Pipeline([
("step_one", feature_estimator),
("step_two", final_estimator_chain)
])
pipeline.fit(X, y)