doc2vec - Input Format for doc2vec training and infer_vector() in python

In gensim, when I give a string as input for training doc2vec model, I get this error :

TypeError('don\'t know how to handle uri %s' % repr(uri))

I referred to this question Doc2vec : TaggedLineDocument() but still have a doubt about the input format.

documents = TaggedLineDocument('myfile.txt')

Should the myFile.txt have tokens as list of lists or separate list in each line for each document or a string?

For eg - I have 2 documents.

Doc 1 : Machine learning is a subfield of computer science that evolved from the study of pattern recognition.

Doc 2 : Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn".

So, what should the myFile.txt look like?

Case 1 : simple text of each document in each line

Machine learning is a subfield of computer science that evolved from the study of pattern recognition

Arthur Samuel defined machine learning as a Field of study that gives computers the ability to learn

Case 2 : a list of lists having tokens of each document

[ ["Machine", "learning", "is", "a", "subfield", "of", "computer", "science", "that", "evolved", "from", "the", "study", "of", "pattern", "recognition"],

["Arthur", "Samuel", "defined", "machine", "learning", "as", "a", "Field", "of", "study", "that", "gives", "computers" ,"the", "ability", "to", "learn"] ]

Case 3 : list of tokens of each document in a separate line

["Machine", "learning", "is", "a", "subfield", "of", "computer", "science", "that", "evolved", "from", "the", "study", "of", "pattern", "recognition"]

["Arthur", "Samuel", "defined", "machine", "learning", "as", "a", "Field", "of", "study", "that", "gives", "computers" ,"the", "ability", "to", "learn"]

And when I am running it on the test data, what should be the format of the sentence which i want to predict the doc vector for? Should it be like case 1 or case 2 below or something else?

model.infer_vector(testSentence, alpha=start_alpha, steps=infer_epoch)

Should the testSentence be :

Case 1 : string

testSentence = "Machine learning is an evolving field"

Case 2 : list of tokens

testSentence = ["Machine", "learning", "is", "an", "evolving", "field"]

Solution

TaggedLineDocument is a convenience class that expects its source file (or file-like object) to be space-delimited tokens, one per line. (That is, what you refer to as 'Case 1' in your 1st question.)

But you can write your own iterable object to feed to gensim Doc2Vec as the documents corpus, as long as this corpus (1) iterably-returns next() objects that, like TaggedDocument, have words and tags lists; and (2) can be iterated over multiple times, for the multiple passes Doc2Vec requires for both the initial vocabulary-survey and then iter training passes.

The infer_vector() method takes lists-of-tokens, similar to the words attribute of individual TaggedDocument-like objects. (That is, what you refer to as 'Case 2' in your 2nd question.)