Search code examples
machine-learningkerasdeep-learningprediction

Can I use embedding layer instead of one hot as category input?


I am trying to use FFM to predict binary labels. My dataset is as follows:

sex|age|price|label
0|0|0|0
1|0|1|1

I know that FFM is a model that consider some attributes as a same field. If I use one hot encoding to transform the dataset, then the dataset will looks like follows:

sex_0|sex_1|age_0|age_1|price_0|price_1|label
0|0|0|0|0|0|0
0|1|0|0|0|1|1

Thus, sex_0 and sex_1 can be considered as one field. The other attributes are similar.

My question is whether can I use embedding layer to repalce the process of one hot encoding? However, this gives me some concerns.

  1. I have no any other related dataset, so I can not use any pre-trained embedding models. I can only randomly initialize the embedding weights and the train it by my own dataset. Will this way approach work?
  2. If I use embedding layer instead of one hot encoding, does it mean that each attribute will belongs one field?
  3. What is the difference between these two methods? Which is better?

Solution

  • Yes you can use embeddings and that approach does work.

    The attribute will not be equal to one element in the embedding but that combination of elements will equal to that attribute. The size of the embedding is something that you will have to select yourself. A good formula to follow is embedding_size = min(50, m+1// 2). Where m is the number of categories so if you have m=10 you will have an embedding size of 5.

    A higher embedding size means it will capture more details on the relationship between the categorical variables.

    In my experience embeddings do help especially when you have 100's of categories(if you have a small number of categories i.e. sex of a person, then one-hot encoding is sufficient) within a certain category.

    On which is better I find embeddings do perform better in general when there are 100's of unique values in a category. Why this is so I do not have any concrete reasons but some intuitions for it.

    For example, representing categories as 300-dimensional dense vectors(word embeddings) requires classifiers to learn far fewer weights than if the categories were represented as 50,000-dimensional vectors(one-hot encoding), and the smaller parameter space possibly helps with generalization and avoiding overfitting.