Search code examples
tensorflowkerasbert-language-modelembedding

Can I feed categorical data in Keras embedding layer without encoding the data?


I am trying to feed multicolumn categorical data into Keras embedding layer. Can I feed categorical data in Keras embedding layer without encoding ?

If not then which encoding method is preferable to retrieve contextual information from the categorical data ?


Solution

  • No you cannot feed categorical data into Keras embedding layer without encoding the data.

    There are couple of ways to encode the data:

    1. Integer Encoding: Where each unique label is mapped to an integer.
    2. One Hot Encoding: Where each label is mapped to a binary vector.
    3. Learned Embedding: Where a distributed representation of the categories is learned.

    The most preferred method to retrieve contextual information from the categorical data is Learned Embedding method. You could use any pertained embeddings from below:

    1. Glove Embeddings (https://nlp.stanford.edu/projects/glove/)
    2. Word2Vec.
    3. ConceptNet (https://github.com/commonsense/conceptnet-numberbatch)
    4. ELMo embeddings (https://github.com/yuanxiaosc/ELMo)

    ELMo embeddings code usage example:

    import tensorflow_hub as hub
    import tensorflow as tf
    
    elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True))