Search code examples
pythonpandasnumpyscikit-learnsklearn-pandas

scikit-learn transformer that bins data based on user supplied cut points


I am trying to include a transformer in a scikit-learn pipeline that will bin a continuous data column into 4 values based on my own supplied cut points. The current arguments to KBinsDiscretizer do not work mainly because the strategy argument only accepts {‘uniform’, ‘quantile’, ‘kmeans’}.

There is already the cut() function in pandas so I guess that I will need to create a custom transformer that wraps the cut() function behavior.

Desired Behavior (not actual)

X = [[-2, -1, -0.5, 0, 0.5, 1, 2]]
est = Discretizer(bins=[-float("inf"), -1.0, 0.0, 1.0, float("inf")], 
                  encode='ordinal')
est.fit(X)  
est.transform(X)
# >>> array([[0., 0., 1., 1., 2., 2., 3.]])

The result above assumes that the bins includes the rightmost edge and include the lowest. Like this pd.cut() command would provide:

import pandas as pd
import numpy as np
pd.cut(np.array([-2, -1, -0.5, 0, 0.5, 1, 2]),
       [-float("inf"), -1.0, 0.0, 1.0, float("inf")], 
       labels=False, right=True, include_lowest=True)
# >>> array([0, 0, 1, 1, 2, 2, 3])

Solution

  • This is what seems to work for me as a custom transformer. scikit-learn expects arrays of numerics so I'm not sure if you can implement the feature of pd.cut() that will return the labels. For this reason I've hard coded it to False in the implementation below.

    import pandas as pd
    from sklearn.base import BaseEstimator, TransformerMixin
    
    class CutTransformer(BaseEstimator, TransformerMixin):
        def __init__(self, bins, right=True, retbins=False,
                     precision=3, include_lowest=False,
                     duplicates='raise'):
            self.bins = bins
            self.right = right
            self.labels = False
            self.retbins = retbins
            self.precision = precision
            self.include_lowest = include_lowest
            self.duplicates = duplicates
    
        def fit(self, X, y=None):
            return self
    
        def transform(self, X, y=None):
            assert isinstance(X, pd.DataFrame)
            for jj in range(X.shape[1]):
                X.iloc[:, jj] = pd.cut(x=X.iloc[:, jj].values, **self.__dict__)
            return X
    

    An Example

    df = pd.DataFrame(data={'rand': np.random.rand(5)})
    df
        rand
    0   0.030653
    1   0.542533
    2   0.159646
    3   0.963112
    4   0.539530
    
    ct = CutTransformer(bins=np.linspace(0, 1, 5))
    ct.transform(df)
        rand
    0   0
    1   2
    2   0
    3   3
    4   2