Search code examples
pythonmachine-learningkerasregressionkeras-layer

How does one input a mixture of 2D and 1D classes (one-hot encoded & regular integers respectively) into the Keras Sequential Model?


I have a pandas DataFrame with 849743 rows and 13 columns, i.e. a shape of (849743,13).

The majority of these columns simply contain integers, however, 3 of them have one-hot encoded categorical variables. They were not encoded using Keras' or sklearn's (or any other library's) one-hot encoding/embedding functionalities, I simply did it manually in python.

For instance, df['d'] is a column with one-hot encoded variables, here is an excerpt:

1082077    [0, 1, 0, 0, 0, 0, 0]
995216     [1, 0, 0, 0, 0, 0, 0]
924611     [0, 0, 0, 0, 1, 0, 0]
1171772    [0, 0, 0, 1, 0, 0, 0]
96796      [0, 0, 1, 0, 0, 0, 0]

Please ignore the nonsensical Pandas indexing.

This is the first row in the column:

array([1, 0, 0, 0, 0, 0, 0])

As can be seen, the elements of this DataFrame's column are all nested numpy arrays.

For further insight into how the Pandas DataFrame is structured, here are all of the elements of the first row:

a                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
b                                  [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
c                                                                     1
d                                                 [1, 0, 0, 0, 0, 0, 0]
e                                                               1.53079
f                                                             -0.415253
g                                                             -0.425906
h                                                             -0.355143
i                                                             -0.249699
j                                                              -0.13448
k                                                              0.882726
l                                                               1.23091

Subsequently, I convert this to a numpy array using:

x_train = df.values

This retains the original dimensions of the DataFrame, which are (849743,13).

I've created a nonsensical Keras Sequential model just to test if the inputs will work, which is how I found the error in the first place. The model is as follows:

# create model
model = Sequential()
model.add(Dense(130, input_dim=13, kernel_initializer='normal', 
          activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')

The input_dim has been set to 13 as there are 13 columns in the DataFrame/numpy array, however, I believe the problem is arising from the nested numpy arrays in the 3 one-hot encoded columns.

I input my original numpy array with 13 columns, known as x_train, alongside y_train (observation variables) into the model.fit function:

model.fit(x_train, y_train,
          epochs=20,
          batch_size=128)

I get the following error:

    Bad input argument to theano function with name "train_function" at index 0 (0-based).  
Backtrace when that variable is created:

  File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)
  File "C:/Users/Studying/Documents/GitHub/IFN665/Machine Learning/keras_regression_practice.py", line 106, in <module>
    model = baseline_model(input_shape)
  File "C:/Users/Studying/Documents/GitHub/IFN665/Machine Learning/keras_regression_practice.py", line 23, in baseline_model
    model.add(Dense(130, input_dim=1, kernel_initializer='normal', activation='relu'))
  File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\keras\models.py", line 432, in add
    dtype=layer.dtype, name=layer.name + '_input')
  File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\keras\engine\topology.py", line 1426, in Input
    input_tensor=tensor)
  File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\keras\legacy\interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\keras\engine\topology.py", line 1337, in __init__
    name=self.name)
  File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\keras\backend\theano_backend.py", line 222, in placeholder
    x = T.TensorType(dtype, broadcast)(name)
setting an array element with a sequence.

I have tried it by removing all one-hot encoded columns and adjusting the input_dim variable accordingly, it does work (work in the sense it doesn't cause an error, the model is obviously a garbage predictor).

I do not believe it is possible (despite a lack of searching) to have a numpy array where certain elements are 2D and some are 1D, such as changing the nested numpy, one-hot encoded, arrays into 2D lists and allowing all the other variables to remain 1D.

I have searched for similar questions on this site, however, everything I find about Keras and one-hot encoding variables appears to either be asking what it is, or how to do it, not how to have a mixture of one-hot encoded and 1D integer inputs.

How can this be done? Am I missing something glaringly obvious?


Solution

  • The problem is because your data is not uniform, when you convert it to a NumPy array some entries are again arrays, the one hot encoded ones, this causes a shape / type mismatch. You have 2 options depending on how you want to process the data:

    1. Flatten the inner arrays, so your final shape is (samples, >13). By flatten I mean have more columns in the NumPy array for the one-hot encoded data. A row would look like a mixture [0,0,1,0,0,..., 2.3492,1.3483,...] so the shape is consistent. Then your input_dim=len(data[0])
    2. If you really want separate inputs, maybe you would like to process them differently like passing into different Dense layers etc, you will need to upgrade to the functional API. It would be a multi-input model, and documentation explains it well.