python scikit-learn data-science pipeline cross-validation

How to create a scikit-learn pipeline that applies z-score and cross-validation?

I am trying to normalize my data at each step of the cross-validation and I came across this question

As suggested, I went to the scikit-learn documentation and found this example:

from sklearn.pipeline import make_pipeline
clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
cross_val_score(clf, X, y, cv=cv)

This looks indeed like what I am trying to achieve, however, my intention is to use a z-scorer instead of the StandardScaler, so I tried this:

clf = make_pipeline(stats.zscore(), DecisionTreeClassifier())

But I get an error saying this:

TypeError: zscore() missing 1 required positional argument: 'a'

What should be the argument of zscore()?

Solution

Welcome to Stack Overflow! There are several ways of using custom functionality in sklearn pipelines — I think FunctionTransformer could fit your case.

Create a transformer that uses zscore and pass the transformer to make_pipeline instead of calling zscore directly.

I hope this helps!