python python-3.x pandas scikit-learn sklearn-pandas

Sklearn KBinsDiscretizer keep origin column names

I'm working on a machine learning problem and I'm discretizing some continuous variables using Sklearn KBinsDiscretizer.

discretizer = KBinsDiscretizer(n_bins=8, encode='onehot')
discretizer.fit(dfDisc)

discretizer.transform(X_train)

Before being transformed, my X_train.columns returns :

["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]

After being transformed (and put back as a pandas df), X_train.columns gives :

[0, 1, 2, 3, 4, 5, ......, 66, 67, 68]

Since I am analysing variables by their original name (A, B, C, ..., J), and have to give feedback about which variables were used for my classification, I'm looking for a way to know which variable is associated to which number of the output. For example, I'm looking to transform my output X_train.columns as

["A_0", "A_1", "A_2", "A_3", "A_4", "B_0", "B_1", "B_2", "B_3", ... ]

I know such a command exist when using the sklearn OneHotEncoder (get_feature_names), but I can't find any way of doing this with KBinsDiscretizer.

One of the idea I had to solve the issue was creating one specific discretizer for each variable, then apply to each column the associated discretizer, and rename columns manually before merging everything, but it would be a mess since I have to save my discretizers...

Also, even though I'm specifying n_bins = 8, I have 69 output columns from my 10 entries, so 1 entry doesn't always produce 10 outputs, and I can't either use this to set column names back.

Solution

Sometimes KBinsDiscretizer doesn't return exactly n_bins for each column/entry. For example, when I ran the following code:

np.random.seed(0)
df = pd.DataFrame(np.random.randint(1, 200, size=(30, 10)), 
                  columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"])
df['B'] = np.random.randint(1, 4, size=30)  # Set only 3 unique values

discretizer = KBinsDiscretizer(n_bins=8, encode='onehot')
discretizer.fit(df)

I got this warning:

Bins whose width are too small (i.e., <= 1e-8) in feature 1 are removed. Consider decreasing the number of bins.

You can review the resulting bins per columns using the n_bins_ attribute (which gets populated during fit).

>>> discretizer.n_bins_
array([8, 3, 8, 8, 8, 8, 8, 8, 8, 8])

You could also use this attribute to name the columns as you requested:

dft = pd.SparseDataFrame(
    discretizer.transform(df), 
    columns=[f'{col}_{b}' for col, bins in zip(df.columns, discretizer.n_bins_) for b in range(bins)]
)