Search code examples
pythonloopstokenizeword-embeddingdoc2vec

Using Tagged Document and Loops in Gensim


I’m in the process of trying to get document similarity values for a corpus of approximately 5,000 legal briefs with Doc2Vec (I recognize that the corpus may be a little bit small, but this is a proof-of-concept project for a larger corpus of approximately 15,000 briefs I’ll have to compile later). Being somewhat new to Python, I initially ran into some trouble creating a preprocessing function for the 5,000 text files I have assembled in a folder, but I’ve managed to create one.

The trouble is that, when I used the Tagged Document feature to assign a “tag” to each document (“words”), only the text from one of the 5,000 documents (.txt files) is used for the “words” portion, and repeats, while the tag (the filename) for each document is used. Basically, one brief is getting tagged 5,000 times, each with a different tag, when I obviously want 5,000 briefs each with a unique tag of its filename.

Below is the code I used. I’m wondering if anyone can help me figure out where I went wrong with this. I don't know if it's a Tagged Document feature, or if it's a problem with the loop I created - perhaps I need another within it, or there's some issue with the way I have the loop read the filepath? I'm relatively new to Python, so that's completely possible.

Thank you!

briefs = []
BriefList = [p for p in os.listdir(FILEPATH) if p.endswith('.txt')]
for brief in BriefList:
     str = open(FILEPATH + brief,'r').read()
     tokens = re.findall(r"[\w']+|[.,!?;]", str)
     tagged_data = [TaggedDocument(tokens, [brief]) for brief in BriefList]
     briefs.append(tagged_data)

Solution

  • At the end of your code, is len(briefs) what you expect it to be? Does looking at items like briefs[0] or briefs[-1] show the individual TaggedDocument items you expect?

    You probably don't want two nested for … in loops - one going over all briefs to open the files, and the other, for each brief, again going over all briefs to assign them all the same tokens value.

    Try changing your lines:

         tagged_data = [TaggedDocument(tokens, [brief]) for brief in BriefList]
         briefs.append(tagged_data)
    

    ...to simply construct & append one TaggedDocument at a time...

         briefs.append(TaggedDocument(tokens, [brief])