Search code examples
pythongensimdoc2vec

applying the Similar function in Gensim.Doc2Vec


I am trying to get the doc2vec function to work in python 3. I Have the following code:

tekstdata = [[ index, str(row["StatementOfTargetFiguresAndPoliciesForTheUnderrepresentedGender"])] for index, row in data.iterrows()]
def prep (x):
    low = x.lower()
    return word_tokenize(low)

def cleanMuch(data, clean):
    output = []
    for x, y in data:
        z = clean(y)
        output.append([str(x), z])
    return output

tekstdata = cleanMuch(tekstdata, prep)

def tagdocs(docs):
    output = []    
    for x,y in docs:
        output.append(gensim.models.doc2vec.TaggedDocument(y, x))
    return output
    tekstdata = tagdocs(tekstdata)

    print(tekstdata[100])

vectorModel = gensim.models.doc2vec.Doc2Vec(tekstdata, size = 100, window = 4,min_count = 3, iter = 2)


ranks = []
second_ranks = []
for x, y in tekstdata:
 print (x)
 print (y)
 inferred_vector = vectorModel.infer_vector(y)
 sims = vectorModel.docvecs.most_similar([inferred_vector], topn=1001,   restrict_vocab = None)
rank = [docid for docid, sim in sims].index(y)
ranks.append(rank)

All works as far as I can understand until the rank function. The error I get is that there is no zero in my list e.g. the documents I am putting in does not have 10 in list:

  File "C:/Users/Niels Helsø/Documents/github/Speciale/Test/Data prep.py", line 59, in <module>
rank = [docid for docid, sim in sims].index(y)

ValueError: '10' is not in list

It seems to me that it is the similar function that does not work. the model trains on my data (1000 documents) and build a vocab which is tagged. The documentation I mainly have used is this: Gensim dokumentation Torturial

I hope that some one can help. If any additional info is need please let me know. best Niels


Solution

  • If you're getting ValueError: '10' is not in list, you can rely on the fact that '10' is not in the list. So have you looked at the list, to see what is there, and if it matches what you expect?

    It's not clear from your code excerpts that tagdocs() is ever called, and thus unclear what form tekstdata is in when provided to Doc2Vec. The intent is a bit convoluted, and there's nothing to display what the data appears as in its raw, original form.

    But perhaps the tags you are supplying to TaggedDocument are not the required list-of-tags, but rather a simple string, which will be interpreted as a list-of-characters. As a result, even if you're supplying a tags of '10', it will be seen as ['1', '0'] – and len(vectorModel.doctags) will be just 10 (for the 10 single-digit strings).

    Separate comments on your setup:

    • 1000 documents is pretty small for Doc2Vec, where most published results use tens-of-thousands to millions of documents
    • an iter of 10-20 is more common in Doc2Vec work (and even larger values might be helpful with smaller datasets)
    • infer_vector() often works better with non-default values in its optional parameters, especially a steps that's much larger (20-200) or a starting alpha that's more like the bulk-training default (0.025)