Search code examples
pythonclassificationnltksentiment-analysisnaivebayes

How to use naive bayes classifier after Extract the features using TF_IDF


I'm Trying to classify features using Naive Bayes classifier, I used TF_IDF for feature extraction.

The finaltfidfVector is a list of vectors, each vector represents list of numbers, 0 if the word not found, else the weight of word if it found.

And classlabels contains all class label for each vector. I'm trying to classify it with this code but it doesn't work.

26652 lines for Dataset

from nltk.classify import apply_features

def naivebyse(finaltfidfVector,classlabels,reviews):

    train_set = []
    j = 0
    for vector in finaltfidfVector:
        arr={}
        if j<18697:
            arr[tuple(vector)] = classlabels[j]
            train_set.append((arr, reviews[j]))
            j += 1

    test_set = []
    j = 18697
    for vector in finaltfidfVector:
        arr = {}
        if j < 26652 and j>=18697:
            arr[tuple(vector)] = classlabels[j]
            test_set.append((arr, reviews[j]))
            j += 1

    classifier = nltk.NaiveBayesClassifier.train(train_set)
    print(nltk.classify.accuracy(classifier, test_set))

The output :

0.0

The used reference for TF_IDF and applied on finaltfidfVector https://triton.ml/blog/tf-idf-from-scratch?fbclid=IwAR3UlCToGYFEQSmugXo3M5Q9fcld79JfXSfBaDG7wKv5a49O0ZDEft9DFNg. data set

this is sample about the used data set before preprocessing and TF_IDF

This is sample for the first vector for index of zero in finaltfidfVector list

[0.0,0.0, 0.0, 0.6214608098422192, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5115995809754083,0.0,0.0, 0.0, 0.0, 0.5521460917862246, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6214608098422192,0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6214608098422192, 0.0, 0.0, 0.0, 0.6214608098422192]

classlabels contains class label for each vector , 1 for sarcasm 0 for not sarcasm. The class label of index 0 is 1, this 1 for the first vector in finaltfidfVector.

The first item for train_set is

({(0.0, 0.0, 1.3803652294655615,.....ect): '0'}, "former versace store clerk sues over secret 'black code' for minority shoppers")


Solution

  • Here's a reproducible toy example:

    # let's define a train_set
    train_set = [
        ({'adam': 0.05,'is': 0.0, 'a': 0.0, 'good': 0.02, 'man': 0.0}, 1),
        ({'eve': 0.0, 'is':  0.0, 'a':  0.0,'good':  0.02,'woman': 0.0}, 1),
        ({'adam': 0.05, 'is': 0.0, 'evil': 0.0}, 0)]
    

    The toy data set is created using a handcrafted "tfidf" score dictionary:

    tfidf_dict = {
     'adam': 0.05,
     'eve': 0.05,
     'evil': 0.02,
     'kind': 0.02,
     'good': 0.02,
     'bad': 0.02
    }
    

    Where each known word has a tfidf score, and an unknow word has score 0. And also in train_set we have positive score for the sentence labeled by 1 ("adam is good"), and negative labeled by 0 ("adam is evil").

    Now run some test:

    import nltk
    clf = nltk.NaiveBayesClassifier.train(train_set)
    

    See how this works on toy train set:

    >>> nltk.classify.accuracy(clf, train_set)
    1.0
    

    Since test set has the same structure as train set, this is sufficient to show how you can train and run a naive bayes classifier.