Search code examples
rtext2vec

Preparing word embeddings in text2vec R package


Based on the text2vec package's vignette, an example is provided to create word embedding.The wiki data is tokenized and then term co-occurrence matrix (TCM) is created which is used to create the word embedding using glove function provided in the package. I want to build word embedding for the movie review data provided with the package. My question is:

  1. Do i need to collapse all the movie reviews into one long string and then do tokenization.

This will cause boundary tokens between 2 reviews to co-occur, which does not make sense.

**vignettes code:**
library(text2vec)
library(readr)
temp <- tempfile()
download.file('http://mattmahoney.net/dc/text8.zip', temp)
wiki <- read_lines(unz(temp, "text8"))
unlink(temp)
# Create iterator over tokens
tokens <- strsplit(wiki, split = " ", fixed = T)
# Create vocabulary. Terms will be unigrams (simple words).
vocab <- create_vocabulary(itoken(tokens))
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# We provide an iterator to create_vocab_corpus function
it <- itoken(tokens)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab, 
                               # don't vectorize input
                               grow_dtm = FALSE, 
                               # use window of 5 for context words
                               skip_grams_window = 5L)
tcm <- create_tcm(it, vectorizer)
fit <- glove(tcm = tcm,
             word_vectors_size = 50,
             x_max = 10, learning_rate = 0.2,
             num_iters = 15)

The data i am interested in developing word embeddings for can be got as follows:

library(text2vec)
data("movie_review")

Solution

  • No, you do not need to concatenate reviews. You need just to construct tcm from correct iterator over tokens:

    library(text2vec)
    data("movie_review")
    tokens = movie_review$review %>% tolower %>%  word_tokenizer
    it = itoken(tokens)
    # create vocabulary
    v = create_vocabulary(it) %>% 
      prune_vocabulary(term_count_min = 5)
    # create co-occurrence vectorizer
    vectorizer = vocab_vectorizer(v, grow_dtm = F, skip_grams_window = 5)
    

    Now we need to reinitialise (for stable 0.3 version. For dev 0.4 don't need to reinitialise iterator):

    it = itoken(tokens)
    tcm = create_tcm(it, vectorizer)
    

    Fit model:

    fit <- glove(tcm = tcm,
                 word_vectors_size = 50,
                 x_max = 10, learning_rate = 0.2,
                 num_iters = 15)