Search code examples
rnlpword2vecword-embedding

Classic king - man + woman = queen example with pretrained word-embedding and word2vec package in R


I am really desperate, I just cannot reproduce the allegedly classic example of king - man + woman = queen with the word2vec package in R and any (!) pre-trained embedding model (as a bin file).

I would be very grateful if anybody could provide working code to reproduce this example... including a link to the necessary pre-trained model which is also downloadable (many are not!).

Thank you very much!


Solution

  • An overview of using word2vec with R is available at https://www.bnosac.be/index.php/blog/100-word2vec-in-r which even shows an example of king - man + woman = queen.

    Just following the instructions there and downloading the first English 300-dim embedding word2vec model from http://vectors.nlpl.eu/repository ran on the British National Corpus which I encountered, downloaded and unzipped the model.bin on my drive and next inspecting the terms in the model (words are there apparently appended with pos tags), getting the word vectors, displaying the vectors, getting the king - man + woman and finding the closest vector to that vector gives ... queen.

    > library(word2vec)
    > model <- read.word2vec("C:/Users/jwijf/OneDrive/Bureaublad/model.bin", normalize = TRUE)
    > head(summary(model, type = "vocabulary"), n = 10)
     [1] "vintage-style_ADJ" "Sinopoli_PROPN"    "Yarrell_PROPN"     "en-1_NUM"          "74°–78°F_X"       
     [6] "bursa_NOUN"        "uni-male_ADJ"      "37541_NUM"         "Menuetto_PROPN"    "Saxena_PROPN"     
    > wv <- predict(model, newdata = c("king_NOUN", "man_NOUN", "woman_NOUN"), type = "embedding")
    > head(t(wv), n = 10)
           king_NOUN    man_NOUN  woman_NOUN
     [1,] -0.4536242 -0.47802860 -1.03320265
     [2,]  0.7096733  1.40374041 -0.91597748
     [3,]  1.1509652  2.35536361  1.57869458
     [4,] -0.2882653 -0.59587735 -0.59021348
     [5,] -0.2110678 -1.05059254 -0.64248675
     [6,]  0.1846713 -0.05871651 -1.01818573
     [7,]  0.5493720  0.13456300  0.38765019
     [8,] -0.9401053  0.56237948  0.02383301
     [9,]  0.1140556 -0.38569298 -0.43408644
    [10,]  0.3657919  0.92853492 -2.56553030
    > wv <- wv["king_NOUN", ] - wv["man_NOUN", ] + wv["woman_NOUN", ]
    > predict(model, newdata = wv, type = "nearest", top_n = 4)
                 term similarity rank
    1       king_NOUN  0.9332663    1
    2      queen_NOUN  0.7813236    2
    3 coronation_NOUN  0.7663506    3
    4   kingship_NOUN  0.7626975    4
    

    Do you prefer to build your own model based on your own text or a more larger corpus e.g. the text8 file. Follow the instructions shown at https://www.bnosac.be/index.php/blog/100-word2vec-in-r. Get a text file and use R package word2vec to build the model, wait untill the model finished training and next interact with it.

    download.file("http://mattmahoney.net/dc/text8.zip", "text8.zip")
    unzip("text8.zip", files = "text8")
    
    > library(word2vec)
    > set.seed(123456789)
    > model <- word2vec(x = "text8", type = "cbow", dim = 100, window = 10, lr = 0.05, iter = 5, hs = FALSE, threads = 2)
    > wv    <- predict(model, newdata = c("king", "man", "woman"), type = "embedding")
    > wv    <- wv["king", ] - wv["man", ] + wv["woman", ]
    > predict(model, newdata = wv, type = "nearest", top_n = 4)
          term similarity rank
    1     king  0.9743692    1
    2    queen  0.8295941    2