I am trying to include a transformer in a scikit-learn pipeline that will bin a continuous data column into 4 values based on my own supplied cut points. The current arguments to KBinsDiscretizer do not work mainly because the strategy
argument only accepts {‘uniform’, ‘quantile’, ‘kmeans’}
.
There is already the cut()
function in pandas so I guess that I will need to create a custom transformer that wraps the cut()
function behavior.
Desired Behavior (not actual)
X = [[-2, -1, -0.5, 0, 0.5, 1, 2]]
est = Discretizer(bins=[-float("inf"), -1.0, 0.0, 1.0, float("inf")],
encode='ordinal')
est.fit(X)
est.transform(X)
# >>> array([[0., 0., 1., 1., 2., 2., 3.]])
The result above assumes that the bins includes the rightmost edge and include the lowest. Like this pd.cut()
command would provide:
import pandas as pd
import numpy as np
pd.cut(np.array([-2, -1, -0.5, 0, 0.5, 1, 2]),
[-float("inf"), -1.0, 0.0, 1.0, float("inf")],
labels=False, right=True, include_lowest=True)
# >>> array([0, 0, 1, 1, 2, 2, 3])
This is what seems to work for me as a custom transformer. scikit-learn expects arrays of numerics so I'm not sure if you can implement the feature of pd.cut()
that will return the labels. For this reason I've hard coded it to False
in the implementation below.
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class CutTransformer(BaseEstimator, TransformerMixin):
def __init__(self, bins, right=True, retbins=False,
precision=3, include_lowest=False,
duplicates='raise'):
self.bins = bins
self.right = right
self.labels = False
self.retbins = retbins
self.precision = precision
self.include_lowest = include_lowest
self.duplicates = duplicates
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
assert isinstance(X, pd.DataFrame)
for jj in range(X.shape[1]):
X.iloc[:, jj] = pd.cut(x=X.iloc[:, jj].values, **self.__dict__)
return X
An Example
df = pd.DataFrame(data={'rand': np.random.rand(5)})
df
rand
0 0.030653
1 0.542533
2 0.159646
3 0.963112
4 0.539530
ct = CutTransformer(bins=np.linspace(0, 1, 5))
ct.transform(df)
rand
0 0
1 2
2 0
3 3
4 2