Search code examples
pythonnlpartificial-intelligenceword-embedding

All words used to used to train Doc2Vec model appear as unknown


I'm trying to make an unsupervised text classifier using Doc2Vec. I am getting the error message of "The following keywords from the 'keywords_list' are unknown to the Doc2Vec model and therefore not used to train the model: sport". The rest of the error message is super long but eventually ends with "cannot compute similarity with no input" which leads me to believe that all of my keywords are not accepted.

The portion of code where I create the Lbl2Vec model is

 Lbl2Vec_model = Lbl2Vec(keywords_list=list(labels["keywords"]), tagged_documents=df_pdfs['tagged_docs'], label_names=list(labels["class_name"]), similarity_threshold=0.43, min_num_docs=10, epochs=10)

I've been following this tutorial https://towardsdatascience.com/unsupervised-text-classification-with-lbl2vec-6c5e040354de but using my own dataset that I load from JSON.

The entire code is posted below:

#imports
from ast import keyword
import tkinter as tk
from tkinter import filedialog
import json

import pandas as pd

import numpy as np

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import strip_tags
from gensim.models.doc2vec import TaggedDocument
from lbl2vec import Lbl2Vec


class DocumentVectorize:
    def __init__(self):
        pass
    
    #imports data from json file
    def import_from_json (self):
        root = tk.Tk()
        root.withdraw()
        json_file_path = filedialog.askopenfile().name
        with open(json_file_path, "r") as json_file:
                try:
                    json_load = json.load(json_file)
                except:
                    raise ValueError("No PDFs to convert to JSON")
        self.pdfs = json_load


if __name__ == "__main__":

#tokenizes documents
    def tokenize(doc):
        return simple_preprocess(strip_tags(doc), deacc=True, min_len=2, max_len=1000000)

    #initializes document vectorization class and imports the data from json
    vectorizer = DocumentVectorize()
    vectorizer.import_from_json()

    #converts json data to dataframe
    df_pdfs = pd.DataFrame.from_dict(vectorizer.pdfs)
    
    #creates data frame that contains the keywords and their class names for the training
    labels =  {"keywords": [["sport"], ["physics"]], "class_name": ["rec.sport", "rec.physics"]}
    labels = pd.DataFrame.from_dict(labels)

    #applies tokenization to documents
    df_pdfs['tagged_docs'] = df_pdfs.apply(lambda row: TaggedDocument(tokenize(row['text_clean']), [str(row.name)]), axis=1)
    #creates key for documents
    df_pdfs['doc_key'] = df_pdfs.index.astype(str)
    print(df_pdfs.head())

    #Initializes Lbl2vec model
    Lbl2Vec_model = Lbl2Vec(keywords_list=list(labels["keywords"]), tagged_documents=df_pdfs['tagged_docs'], label_names=list(labels["class_name"]), similarity_threshold=0.43, min_num_docs=10, epochs=10)

    #Fits Lbl2vec model to data
    Lbl2Vec_model.fit()

Solution

  • The problem that was occurring is that the default initialization of the Lbl2Vec model has the min_count for words = 50. This means that a word needs to occur 50 times for it to be included in the word vectorization set. This wasn't occurring for any words in any of my documents and thus all of my keywords would be rejected and that was causing the error message I was receiving. This is what the updated code for that line would look like:

     Lbl2Vec_model = Lbl2Vec(keywords_list=list(labels["keywords"]), tagged_documents=df_pdfs['tagged_docs'], label_names=list(labels["class_name"]), min_count = 2, similarity_threshold=0.43, min_num_docs=10, epochs=10)