Search code examples
rword2vectext2vecglove

Glove word embedding model parameters using tex2vec in R, and display training output (epochs) after every n iterations


I am using text2vec package in R for training word embedding (Glove Model) as:

library(text2vec)
library(tm)

prep_fun = tolower
tok_fun = word_tokenizer
tokens = docs %>%  # docs: a collection of text documents  
prep_fun %>% 
tok_fun

it = itoken(tokens, progressbar = FALSE)

stopword <- tm::stopwords("SMART")
vocab = create_vocabulary(it,stopwords=stopword) 

vectorizer <- vocab_vectorizer(vocab)

tcm <- create_tcm(it, vectorizer, skip_grams_window = 6)

x_max <- min(50,max(10,ceiling(length(vocab$doc_count)/100)))
glove_model <- GlobalVectors$new(word_vectors_size = 200, vocabulary = vocab, x_max = x_max,learning_rate = 0.1) 

word_vectors <- glove_model$fit_transform(tcm, n_iter = 1000, convergence_tol = 0.001)

When I run this code I get the following output: enter image description here

My questions are:

  1. Is it possible to have output after every n iterations, i.e. output for epoch 50, 100, 150 and so on.
  2. Any suggestion for optimal values for word_vectors_size, x_max and learning_rate? for example for 10,000 documents, what is the best value for those parameters?

I appreciate your response.

Many thanks, Sam


Solution

  • There is a member of the GlobalVectors class called n_dump_every. You can set it to some number and the history of word embeddings will be saved. Then it can be retrieved with get_history() function

    glove_model <- GlobalVectors$new(word_vectors_size = 200, vocabulary = vocab, x_max = 100,learning_rate = 0.1) 
    glove_model$n_dump_every = 10
    word_vectors <- glove_model$fit_transform(tcm, n_iter = 1000, convergence_tol = 0.001)
    trace = glove_model$get_history()
    

    Regarding second question -

    • you may try to vary learning rate a bit (usually decrease), but default one should be ok (keep track of the value of cost function).
    • the more data you have the larger value you can provide for word_vectors_size. For wikipedia size 300 is usually enough. For smaller datasets you may start with 20-50. You really need to experiment with this.