Search code examples
pythontensorflowword2vecfasttext

Use of fasttext Pre-trained word vector as embedding in tensorflow script


Can I use fasttext word vector like the ones here: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md in a tensorflow script as an embedding vector instead of word2vec or glove without using the library fasttext


Solution

  • When you use pre-trained word vector, you can use gensim libarary.

    For your reference. https://blog.manash.me/how-to-use-pre-trained-word-vectors-from-facebooks-fasttext-a71e6d55f27

    In [1]: from gensim.models import KeyedVectors
    
    In [2]: jp_model = KeyedVectors.load_word2vec_format('wiki.ja.vec')
    
    In [3]: jp_model.most_similar('car')
    Out[3]: 
    [('cab', 0.9970724582672119),
     ('tle', 0.9969051480293274),
     ('oyc', 0.99671471118927),
     ('oyt', 0.996662974357605),
     ('車', 0.99665766954422),
     ('s', 0.9966464638710022),
     ('新車', 0.9966358542442322),
     ('hice', 0.9966053366661072),
     ('otg', 0.9965877532958984),
     ('車両', 0.9965814352035522)]
    

    EDIT

    I created a new branch forked from cnn-text-classification-tf. Here is the link. https://github.com/satojkovic/cnn-text-classification-tf/tree/use_fasttext

    In this branch, there are three modifications to use fasttext.

    1. Extract the vocab and the word_vec from fasttext. (util_fasttext.py)
    model = KeyedVectors.load_word2vec_format('wiki.en.vec')
    vocab = model.vocab
    embeddings = np.array([model.word_vec(k) for k in vocab.keys()])
    
    with open('fasttext_vocab_en.dat', 'wb') as fw:
        pickle.dump(vocab, fw, protocol=pickle.HIGHEST_PROTOCOL)
    np.save('fasttext_embedding_en.npy', embeddings)
    
    1. Embedding layer

      W is initialized by zeros, and then an embedding_placeholder is set up to receive the word_vec, and finally W is assigned. (text_cnn.py)

    W_ = tf.Variable(
        tf.constant(0.0, shape=[vocab_size, embedding_size]),
        trainable=False,
        name='W')
    
    self.embedding_placeholder = tf.placeholder(
        tf.float32, [vocab_size, embedding_size],
        name='pre_trained')
    
    W = tf.assign(W_, self.embedding_placeholder)
    
    1. Use the vocab and the word_vec

      The vocab is used to build the word-id maps, and the word_vec is fed into the embedding_placeholder.

    with open('fasttext_vocab_en.dat', 'rb') as fr:
        vocab = pickle.load(fr)
    embedding = np.load('fasttext_embedding_en.npy')
    
    pretrain = vocab_processor.fit(vocab.keys())
    x = np.array(list(vocab_processor.transform(x_text)))
    
    feed_dict = {
        cnn.input_x: x_batch,
        cnn.input_y: y_batch,
        cnn.dropout_keep_prob: FLAGS.dropout_keep_prob,
        cnn.embedding_placeholder: embedding
    }
    

    Please try it out.