I am trying to fit the Doc2Vec method in a dataframe which the first column has the texts, and the second one the label (author). I have found this article https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4, which is really helpful. However, I am stuck at how to build a model
import tqdm
cores = multiprocessing.cpu_count()
model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample=0, workers=cores)
model_dbow.build_vocab([x for x in tqdm(train_tagged.values)])
TypeError: 'module' object is not callable
Could you please help me how to overcome this issue?
Before that I have also this code
train, test = train_test_split(df, test_size=0.3, random_state=42)
import nltk
from nltk.corpus import stopwords
def tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if len(word) < 2:
continue
tokens.append(word.lower())
return tokens
train_tagged = train.apply(
lambda r: TaggedDocument(words=tokenize_text(r['text']), tags=[r.author]), axis=1)
test_tagged = test.apply(
lambda r: TaggedDocument(words=tokenize_text(r['text']), tags=[r.author]), axis=1)
Edit: if I remove tqdm from the code is working, but I am not sure is this is accepted. tqdm as I know is a package for Python that enables you to instantly create progress bars and estimate TTC (Time To Completion) for your functions and loops, so I mean If I remove it, there is no problem with the output. Right?
Edit2: See also this question My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong? to improve the code of the tutorial. Thanks again @gojomo
You are importing tqdm
module and not the actual class.
replace import tqdm
with from tqdm import tqdm