Search code examples
pythonmachine-learningone-hot-encoding

Using One-Hot Encoding vector as a feature for machine learning models


I have a categorical column ('session', than can get one of these values: [2,4,8]), that I want to use while training a machine learning model (like RandomForest or MLP).

In order to do that, I encoded this feature using the One-Hot Encode method:

df= pd.get_dummies(df, columns=["session"], prefix="Sessions")

and I got three new columns: Session_2, Session_4, Session_8 instead of the old session column.

Then I converted these new 3 columns into one vector (as a list) and populated 'session' column with that list:

df['session'] = np.array(df[['Sessions_2', 'Sessions_4', 'Sessions_8']], dtype=object).tolist()

So, now the data looks like:

enter image description here

When trying to train the ML model I thought that it's better to use the new vector 'session' column and not the separated Session_x columns (otherwise, for what we did the one-hot encoding!)

But I'm getting this error:

ValueError: setting an array element with a sequence.

I searched for that error, and everywhere it was mentioned that the root cause might be when the shape is not the same or the elements have different data types... but this is not the case in my case! I verified that all vectors have the same size and all have the same types! (I used as well dtype=object when creating the np array)

I believe that the issue might be trying to load n-element array (sequence) into a single number slot which only has a float! I tried with 2 different ML models: RandomForest and MLP and still getting the same.

How can I make my ML model work with the One-Hot encode vector? (is it the right approach at first place? to use a vector?)


Solution

  • Your data frame already contains the one-hot encoding of the categorical feature, which literally is the combination of three existing columns, Session_2,4,8. No need of including that session column (object-type) as it is redundant and invalid.