from gensim.models import Word2Vec
import time
# Skip-gram model (sg = 1)
size = 1000
window = 3
min_count = 1
workers = 3
sg = 1
word2vec_model_file = 'word2vec_' + str(size) + '.model'
start_time = time.time()
stemmed_tokens = pd.Series(df['STEMMED_TOKENS']).values
# Train the Word2Vec Model
w2v_model = Word2Vec(stemmed_tokens, min_count = min_count, size = size, workers = workers, window = window, sg = sg)
print("Time taken to train word2vec model: " + str(time.time() - start_time))
w2v_model.save(word2vec_model_file)
This is the code I have written. I applied this file on all ML algorithms for binary classification but all algorithms gives same result 0.48. How does it possible ? ANd also this result is very poor compare to BERT and TFIDF scores.
A vector size
of 1000 dimensions is very uncommon, and would require massive amounts of data to train. For example, the famous GoogleNews
vectors were for 3 million words, trained on something like 100 billion corpus words - and still only 300 dimensions. Your STEMMED_TOKENS
may not be enough data to justify 100-dimensional vectors, much less 300 or 1000.
A choice of min_count=1
is a bad idea. This algorithm can't learn anything valuable from words that only appear a few times. Typically people get better results by discarding rare words entirely, as the default min_count=5
will do. (If you have a lot of data, you're likely to increase this value to discard even more words.)
Are you examining the model's size or word-to-word results at all to ensure it's doing what you expect? Despite your colum being named STEMMED_TOKENS
, I don't see any actual splitting-into-tokens, and the Word2Vec
class expects each text to be a list-of-strings, not a string.
Finally, without seeing all your other choices for feeding word-vector-enriched data to your other classification steps, it is possible (likely even) that there are other errors there.
Given that a binary-classification model can always get at least 50% accuracy by simply classifying every example with whichever class is more common, any accuracy result less than 50% should immediately cause suspicions of major problems in your process like: