In gensim, when I give a string as input for training doc2vec model, I get this error :
TypeError('don\'t know how to handle uri %s' % repr(uri))
I referred to this question Doc2vec : TaggedLineDocument() but still have a doubt about the input format.
documents = TaggedLineDocument('myfile.txt')
Should the myFile.txt have tokens as list of lists or separate list in each line for each document or a string?
For eg
- I have 2 documents.
Doc 1 : Machine learning is a subfield of computer science that evolved from the study of pattern recognition.
Doc 2 : Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn".
So, what should the myFile.txt
look like?
Case 1 : simple text of each document in each line
Machine learning is a subfield of computer science that evolved from the study of pattern recognition
Arthur Samuel defined machine learning as a Field of study that gives computers the ability to learn
Case 2 : a list of lists having tokens of each document
[ ["Machine", "learning", "is", "a", "subfield", "of", "computer", "science", "that", "evolved", "from", "the", "study", "of", "pattern", "recognition"]
,
["Arthur", "Samuel", "defined", "machine", "learning", "as", "a", "Field", "of", "study", "that", "gives", "computers" ,"the", "ability", "to", "learn"] ]
Case 3 : list of tokens of each document in a separate line
["Machine", "learning", "is", "a", "subfield", "of", "computer", "science", "that", "evolved", "from", "the", "study", "of", "pattern", "recognition"]
["Arthur", "Samuel", "defined", "machine", "learning", "as", "a", "Field", "of", "study", "that", "gives", "computers" ,"the", "ability", "to", "learn"]
And when I am running it on the test data, what should be the format of the sentence which i want to predict the doc vector for? Should it be like case 1 or case 2 below or something else?
model.infer_vector(testSentence, alpha=start_alpha, steps=infer_epoch)
Should the testSentence be :
Case 1 : string
testSentence = "Machine learning is an evolving field"
Case 2 : list of tokens
testSentence = ["Machine", "learning", "is", "an", "evolving", "field"]
TaggedLineDocument
is a convenience class that expects its source file (or file-like object) to be space-delimited tokens, one per line. (That is, what you refer to as 'Case 1' in your 1st question.)
But you can write your own iterable object to feed to gensim Doc2Vec
as the documents
corpus, as long as this corpus (1) iterably-returns next()
objects that, like TaggedDocument, have words
and tags
lists; and (2) can be iterated over multiple times, for the multiple passes Doc2Vec
requires for both the initial vocabulary-survey and then iter
training passes.
The infer_vector()
method takes lists-of-tokens, similar to the words
attribute of individual TaggedDocument
-like objects. (That is, what you refer to as 'Case 2' in your 2nd question.)