I am trying to train a deep learning model for a regression problem. I have 2000 significant categorical inputs each of which has 3 categories. If I convert them to dummy variables, then I will have 6,000 dummy variables as input to deep learning model and it makes optimization very hard since my inputs (6,000 dummy variables) are not zero centered. Also, variance in each dummy variable is small so 6,000 dummy variables will have a hard time to explain variance in output. I was wondering if I need to use z score for dummy variables to help optimization? Also, is there a better solution to deal with these 2,000 categorical inputs?
You should use Embeddings, which translates large sparse vectors into a lower-dimensional space that preserves semantic relationships. So for each categorical feature, you will have dense vector representation.
Here is pseudocode using TensorFlow:
unique_amount_1 = np.unique(col1)
input_1 = tf.keras.layers.Input(shape=(1,), name='input_1')
embedding_1 = tf.keras.layers.Embedding(unique_amount_1, 50, trainable=True)(input_1)
col1_embedding = tf.keras.layers.Flatten()(embedding_1)
unique_amount_2 = np.unique(col2)
input_2 = tf.keras.layers.Input(shape=(1,), name='input_2')
embedding_2 = tf.keras.layers.Embedding(unique_amount_2, 50, trainable=True)(input_2)
col2_embedding = tf.keras.layers.Flatten()(embedding_2)
combined = tf.keras.layers.concatenate([col1_embedding, col2_embedding])
result = tf.keras.layers.Dense()(combined)
model = tf.keras.Model(inputs=[col1, col2], outputs=result)
Where 50
- the size of the embedding vector.