Search code examples
pythonlogistic-regressionrandom-forestword2vecnaivebayes

I applied W2V on ML Algorithms. It gives error of negative value for NB and gives 0.48 accuracy for for all the other algorithms. How come?


from gensim.models import Word2Vec
import time
# Skip-gram model (sg = 1)
size = 1000
window = 3
min_count = 1
workers = 3
sg = 1

word2vec_model_file = 'word2vec_' + str(size) + '.model'
start_time = time.time()
stemmed_tokens = pd.Series(df['STEMMED_TOKENS']).values
# Train the Word2Vec Model
w2v_model = Word2Vec(stemmed_tokens, min_count = min_count, size = size, workers = workers, window = window, sg = sg)
print("Time taken to train word2vec model: " + str(time.time() - start_time))
w2v_model.save(word2vec_model_file)

This is the code I have written. I applied this file on all ML algorithms for binary classification but all algorithms gives same result 0.48. How does it possible ? ANd also this result is very poor compare to BERT and TFIDF scores.


Solution

  • A vector size of 1000 dimensions is very uncommon, and would require massive amounts of data to train. For example, the famous GoogleNews vectors were for 3 million words, trained on something like 100 billion corpus words - and still only 300 dimensions. Your STEMMED_TOKENS may not be enough data to justify 100-dimensional vectors, much less 300 or 1000.

    A choice of min_count=1 is a bad idea. This algorithm can't learn anything valuable from words that only appear a few times. Typically people get better results by discarding rare words entirely, as the default min_count=5 will do. (If you have a lot of data, you're likely to increase this value to discard even more words.)

    Are you examining the model's size or word-to-word results at all to ensure it's doing what you expect? Despite your colum being named STEMMED_TOKENS, I don't see any actual splitting-into-tokens, and the Word2Vec class expects each text to be a list-of-strings, not a string.

    Finally, without seeing all your other choices for feeding word-vector-enriched data to your other classification steps, it is possible (likely even) that there are other errors there.

    Given that a binary-classification model can always get at least 50% accuracy by simply classifying every example with whichever class is more common, any accuracy result less than 50% should immediately cause suspicions of major problems in your process like:

    • misalignment of examples & labels
    • insufficient/unrepresentative training data
    • some steps not running at all due to data-prep or invocation errors