I would like to train the word2vec model on my own corpus using the rword2vec
package in R.
The word2vec
function that is used to train the model requires a train_file
. The package's documentation in R simply notes that this is the training text data, but doesn't specify how it can be created.
The training data used in the example on GitHub can be downloaded here: http://mattmahoney.net/dc/text8.zip. I can't figure out what type of file it is.
I've looked through the README file on the rword2vec GitHub page and checked out the official word2vec page on Google Code.
My corpus is a .csv
file with about 68,000 documents. File size is roughly 300MB. I realize that training the model on a corpus of this size might take a long time (or be infeasible), but I'm willing to train it on a subset of the corpus. I just don't know how to create the train_file
required by the function.
After you unzip text8, you can open it with a text editor. You'll see that it is one long document. You will need to decide how many of your 68,000 documents you want to use for training and whether you want to concatenate them together of keep them as separate documents. See https://datascience.stackexchange.com/questions/11077/using-several-documents-with-word2vec