Search code examples
rnlpword2vectext2vec

Use a pre trained model with text2vec?


I would like to use a pre trained model with text2vec. My understanding was that the benefit here is that these models have been trained on a huge volume of data already, e.g. Google News Model.

Reading the text2vec documentation it looks like the getting started code reads in text data then trains a model with it:

library(text2vec)
text8_file = "~/text8"
if (!file.exists(text8_file)) {
  download.file("http://mattmahoney.net/dc/text8.zip", "~/text8.zip")
  unzip ("~/text8.zip", files = "text8", exdir = "~/")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)

The documentation then proceeds to show one how to create tokens and a vocab:

# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

Then, this looks like the step to fit the model:

glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)

My question is, is the well know Google pre trained word2vec model usable here without the need to rely on my own vocab or my own local device to train the model? If yes, how could I read it in and use it in r?

I think I'm misunderstanding or missing something here? Can I use text2vec for this task?


Solution

  • At the moment text2vec doesn't provide any functionality for downloading/manipulating pre-trained word embeddings. I have a drafts to add such utilities to the next release.

    But on other side you can easily do it manually with just standard R tools. For example here is how to read fasttext vectors:

    con = url("https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.af.300.vec.gz", "r")
    con = gzcon(con)
    wv = readLines(con, n = 10)
    

    Then you need just to parse it - strsplit and rbind are your friends.