python pandas numpy scikit-learn sklearn-pandas

scikit-learn transformer that bins data based on user supplied cut points

I am trying to include a transformer in a scikit-learn pipeline that will bin a continuous data column into 4 values based on my own supplied cut points. The current arguments to KBinsDiscretizer do not work mainly because the strategy argument only accepts {‘uniform’, ‘quantile’, ‘kmeans’}.

There is already the cut() function in pandas so I guess that I will need to create a custom transformer that wraps the cut() function behavior.

Desired Behavior (not actual)

X = [[-2, -1, -0.5, 0, 0.5, 1, 2]]
est = Discretizer(bins=[-float("inf"), -1.0, 0.0, 1.0, float("inf")], 
                  encode='ordinal')
est.fit(X)  
est.transform(X)
# >>> array([[0., 0., 1., 1., 2., 2., 3.]])

The result above assumes that the bins includes the rightmost edge and include the lowest. Like this pd.cut() command would provide:

import pandas as pd
import numpy as np
pd.cut(np.array([-2, -1, -0.5, 0, 0.5, 1, 2]),
       [-float("inf"), -1.0, 0.0, 1.0, float("inf")], 
       labels=False, right=True, include_lowest=True)
# >>> array([0, 0, 1, 1, 2, 2, 3])

Solution

This is what seems to work for me as a custom transformer. scikit-learn expects arrays of numerics so I'm not sure if you can implement the feature of pd.cut() that will return the labels. For this reason I've hard coded it to False in the implementation below.

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class CutTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, bins, right=True, retbins=False,
                 precision=3, include_lowest=False,
                 duplicates='raise'):
        self.bins = bins
        self.right = right
        self.labels = False
        self.retbins = retbins
        self.precision = precision
        self.include_lowest = include_lowest
        self.duplicates = duplicates

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        assert isinstance(X, pd.DataFrame)
        for jj in range(X.shape[1]):
            X.iloc[:, jj] = pd.cut(x=X.iloc[:, jj].values, **self.__dict__)
        return X

An Example

df = pd.DataFrame(data={'rand': np.random.rand(5)})
df
    rand
0   0.030653
1   0.542533
2   0.159646
3   0.963112
4   0.539530

ct = CutTransformer(bins=np.linspace(0, 1, 5))
ct.transform(df)
    rand
0   0
1   2
2   0
3   3
4   2