Search code examples
pythontensorflowkerasembedding

How to make an array as a word embedding, similar to tf.keras.datasets.imdb.get_word_index?


Im new to machine learning. I saw a code for binary classification with Movie reviewsd from IMDB. I was trying to use the same code with my own dataset (where the columns are "text": this is my emotional sentence, "labels": 0 or 1).

I want to make a word embedding called word_index, similar to tf.keras.datasets.imdb.get_word_index

{'fawn': 34701, 'tsukino': 52006, 'nunnery': 52007, 'sonja': 16816, 'vani': 63951, 'woods': 1408, ...}

What I tried is this, but Im not sure if is the same result than get_word_index gives

{k: v for k, v in enumerate(my_dataset)}

Solution

  • I think you have mixed up the terms word embeddings and word_index. Word embeddings are vector representations of words in a language, there are many methods available to get these representations (for eg. using pre-trained word embeddings like Word2Vec, GloVe, BERT, etc.). It can be used instead of one-hot encodings for words.

    Word_index is a vocabulary generated from the input text collection based on the frequencies of words. tf.keras.datasets.imdb.get_word_index gives the word_index to the IMDB dataset. To get word_index for your dataset, you can use keras.preprocessing.text.Tokenizer.fit_on_texts(input_dataset). It is also nicely explained in this previous post.