I have a pandas DataFrame with 849743 rows and 13 columns, i.e. a shape of (849743,13).
The majority of these columns simply contain integers, however, 3 of them have one-hot encoded categorical variables. They were not encoded using Keras' or sklearn's (or any other library's) one-hot encoding/embedding functionalities, I simply did it manually in python.
For instance, df['d'] is a column with one-hot encoded variables, here is an excerpt:
1082077 [0, 1, 0, 0, 0, 0, 0]
995216 [1, 0, 0, 0, 0, 0, 0]
924611 [0, 0, 0, 0, 1, 0, 0]
1171772 [0, 0, 0, 1, 0, 0, 0]
96796 [0, 0, 1, 0, 0, 0, 0]
Please ignore the nonsensical Pandas indexing.
This is the first row in the column:
array([1, 0, 0, 0, 0, 0, 0])
As can be seen, the elements of this DataFrame's column are all nested numpy arrays.
For further insight into how the Pandas DataFrame is structured, here are all of the elements of the first row:
a [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
b [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
c 1
d [1, 0, 0, 0, 0, 0, 0]
e 1.53079
f -0.415253
g -0.425906
h -0.355143
i -0.249699
j -0.13448
k 0.882726
l 1.23091
Subsequently, I convert this to a numpy array using:
x_train = df.values
This retains the original dimensions of the DataFrame, which are (849743,13).
I've created a nonsensical Keras Sequential model just to test if the inputs will work, which is how I found the error in the first place. The model is as follows:
# create model
model = Sequential()
model.add(Dense(130, input_dim=13, kernel_initializer='normal',
activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
The input_dim has been set to 13 as there are 13 columns in the DataFrame/numpy array, however, I believe the problem is arising from the nested numpy arrays in the 3 one-hot encoded columns.
I input my original numpy array with 13 columns, known as x_train, alongside y_train (observation variables) into the model.fit function:
model.fit(x_train, y_train,
epochs=20,
batch_size=128)
I get the following error:
Bad input argument to theano function with name "train_function" at index 0 (0-based).
Backtrace when that variable is created:
File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Studying/Documents/GitHub/IFN665/Machine Learning/keras_regression_practice.py", line 106, in <module>
model = baseline_model(input_shape)
File "C:/Users/Studying/Documents/GitHub/IFN665/Machine Learning/keras_regression_practice.py", line 23, in baseline_model
model.add(Dense(130, input_dim=1, kernel_initializer='normal', activation='relu'))
File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\keras\models.py", line 432, in add
dtype=layer.dtype, name=layer.name + '_input')
File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\keras\engine\topology.py", line 1426, in Input
input_tensor=tensor)
File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\keras\legacy\interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\keras\engine\topology.py", line 1337, in __init__
name=self.name)
File "C:\Users\Studying\AppData\Local\conda\conda\envs\Tensorflow-gpu\lib\site-packages\keras\backend\theano_backend.py", line 222, in placeholder
x = T.TensorType(dtype, broadcast)(name)
setting an array element with a sequence.
I have tried it by removing all one-hot encoded columns and adjusting the input_dim variable accordingly, it does work (work in the sense it doesn't cause an error, the model is obviously a garbage predictor).
I do not believe it is possible (despite a lack of searching) to have a numpy array where certain elements are 2D and some are 1D, such as changing the nested numpy, one-hot encoded, arrays into 2D lists and allowing all the other variables to remain 1D.
I have searched for similar questions on this site, however, everything I find about Keras and one-hot encoding variables appears to either be asking what it is, or how to do it, not how to have a mixture of one-hot encoded and 1D integer inputs.
How can this be done? Am I missing something glaringly obvious?
The problem is because your data is not uniform, when you convert it to a NumPy array some entries are again arrays, the one hot encoded ones, this causes a shape / type mismatch. You have 2 options depending on how you want to process the data:
[0,0,1,0,0,..., 2.3492,1.3483,...]
so the shape is consistent. Then your input_dim=len(data[0])