I'm trying to make an unsupervised text classifier using Doc2Vec. I am getting the error message of "The following keywords from the 'keywords_list' are unknown to the Doc2Vec model and therefore not used to train the model: sport". The rest of the error message is super long but eventually ends with "cannot compute similarity with no input" which leads me to believe that all of my keywords are not accepted.
The portion of code where I create the Lbl2Vec model is
Lbl2Vec_model = Lbl2Vec(keywords_list=list(labels["keywords"]), tagged_documents=df_pdfs['tagged_docs'], label_names=list(labels["class_name"]), similarity_threshold=0.43, min_num_docs=10, epochs=10)
I've been following this tutorial https://towardsdatascience.com/unsupervised-text-classification-with-lbl2vec-6c5e040354de but using my own dataset that I load from JSON.
The entire code is posted below:
#imports
from ast import keyword
import tkinter as tk
from tkinter import filedialog
import json
import pandas as pd
import numpy as np
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import strip_tags
from gensim.models.doc2vec import TaggedDocument
from lbl2vec import Lbl2Vec
class DocumentVectorize:
def __init__(self):
pass
#imports data from json file
def import_from_json (self):
root = tk.Tk()
root.withdraw()
json_file_path = filedialog.askopenfile().name
with open(json_file_path, "r") as json_file:
try:
json_load = json.load(json_file)
except:
raise ValueError("No PDFs to convert to JSON")
self.pdfs = json_load
if __name__ == "__main__":
#tokenizes documents
def tokenize(doc):
return simple_preprocess(strip_tags(doc), deacc=True, min_len=2, max_len=1000000)
#initializes document vectorization class and imports the data from json
vectorizer = DocumentVectorize()
vectorizer.import_from_json()
#converts json data to dataframe
df_pdfs = pd.DataFrame.from_dict(vectorizer.pdfs)
#creates data frame that contains the keywords and their class names for the training
labels = {"keywords": [["sport"], ["physics"]], "class_name": ["rec.sport", "rec.physics"]}
labels = pd.DataFrame.from_dict(labels)
#applies tokenization to documents
df_pdfs['tagged_docs'] = df_pdfs.apply(lambda row: TaggedDocument(tokenize(row['text_clean']), [str(row.name)]), axis=1)
#creates key for documents
df_pdfs['doc_key'] = df_pdfs.index.astype(str)
print(df_pdfs.head())
#Initializes Lbl2vec model
Lbl2Vec_model = Lbl2Vec(keywords_list=list(labels["keywords"]), tagged_documents=df_pdfs['tagged_docs'], label_names=list(labels["class_name"]), similarity_threshold=0.43, min_num_docs=10, epochs=10)
#Fits Lbl2vec model to data
Lbl2Vec_model.fit()
The problem that was occurring is that the default initialization of the Lbl2Vec model has the min_count for words = 50. This means that a word needs to occur 50 times for it to be included in the word vectorization set. This wasn't occurring for any words in any of my documents and thus all of my keywords would be rejected and that was causing the error message I was receiving. This is what the updated code for that line would look like:
Lbl2Vec_model = Lbl2Vec(keywords_list=list(labels["keywords"]), tagged_documents=df_pdfs['tagged_docs'], label_names=list(labels["class_name"]), min_count = 2, similarity_threshold=0.43, min_num_docs=10, epochs=10)