python dataframe machine-learning scikit-learn pipeline

sklearn Pipeline for custome operations

I have been trying to create a pipeline for some basic classification task. Though, I am unable to find how to implement the below operations using sklearn.Pipeline

Add some steps just for training data & not test data
Implement a 'df.apply' sort of function

I tried reading some medium blogs & documentation but in vain.

Solution

I think there are probably lots of ways to do this. It's not strictly speaking something you necessarily have to do with sklearn.Pipeline. You might use something like airflow to orchestrate the steps of your classification task, or you could even use something like zenml which is built to exactly handle these kinds of tasks. You can wrap each of your steps in a simple @step decorator, then chain them together in a pipeline.

The quickstart guide has a simple example that I think might well suit your purposes. Otherwise check out the Github page for more details.

Disclaimer: I'm an engineer working at ZenML myself, so this is admittedly biased! Nevertheless, I think it might be useful for you. You can even do things like run your pipelines on an Airflow orchestrator pretty easily.