Search code examples
python-2.7data-structuresscikit-learnnltknaivebayes

Naive Bayes for Text Classification - Python 2.7 Data Structure Issue


I am having an issue training my Naive Bayes Classifier. I have a feature set and targets that I want to use but I keep getting errors. I've had a look at other people who have similar problems but I can't seem to figure out the issue. I'm sure there's a simple solution but I'm yet to find it.

Here's an example of the structure of the data that I'm trying to use to train the classifier.

In [1] >> train[0]
Out[1] ({
         u'profici': [False],
         u'saver': [False],
         u'four': [True],
         u'protest': [False],
         u'asian': [True],
         u'upsid': [False],
         .
         .
         .
         u'captain': [False],
         u'payoff': [False],
         u'whose': [False]
         },
         0)

Where train[0] is the first tuple in a list and contains:

  • A dictionary of features and boolean values to indicate the presence or absence of words in document[0]

  • The target label for the binary classification of document[0]

Obviously, the rest of the train list has the features and labels for the other documents that I want to classify.

When running the following code

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB

MNB_clf = SklearnClassifier(MultinomialNB())
MNB_clf.train(train)

I get the error message:

  TypeError: float() argument must be a string or a number 

Edit:

features are created here. From a dataframe post_sent that contains the posts in column 1 and the sentiment classification in column 2.

  stopwords = set(stopwords.words('english'))
  tokenized = []
  filtered_posts = []
  punc_tokenizer = RegexpTokenizer(r'\w+')

  #  tokenizing and removing stopwords
   for post in post_sent.post:
      tokenized = [word.lower() for word in. 
      punc_tokenizer.tokenize(post)]
      filtered = ([w for w in tokenized if not w in stopwords])
  filtered_posts.append(filtered)    

  # stemming
  tokened_stemmed = []
  for post in filtered_posts:
      stemmed = []
  for w in post:
       stemmed.append(PorterStemmer().stem_word(w))
  tokened_stemmed.append(stemmed)   

  #frequency dist
 all_words =. 
   list(itertools.chain.from_iterable(tokened_stemmed))
   frequency = FreqDist(all_words)

  # Feature selection
  word_features = list(frequency.keys())[:3000]

   # IMPORTANT PART
   #######################
   #------ featuresets creation ---------
  def find_features(list_of_posts):
       features = {}
       wrds = set(post)
           for w in word_features:
              features[w] = [w in wrds]
  return features

  # zipping inputs with targets
  words_and_sent = zip(tokened_stemmed, 
   post_sent.sentiment)

   # IMPORTANT PART 
   ##########################
  # feature sets created here
  featuresets = [(find_features(words), sentiment) for 
   words, sentiment in 
   words_and_sent]

Solution

  • You are setting the train wrong. As @lenz said in comment, remove the parentheses in the feature dict values and only use single values.

    As given in the official documentation:

    labeled_featuresets – A list of (featureset, label) where each featureset is a dict mapping strings to either numbers, booleans or strings.

    But you are setting the mapping (value of key in dict) as a list.

    You correct train should look like :

    [({u'profici':False,
       u'saver':False,
       u'four':True,
       u'protest':False,
       u'asian':True,
       u'upsid':False,
       .
       .
      }, 0),
         .. 
         ..
     ({u'profici':True,
       u'saver':False,
       u'four':False,
       u'protest':False,
       u'asian':True,
       u'upsid':False,
       .
       .
      }, 1)]
    

    You can take a look at more examples here: - http://www.nltk.org/howto/classify.html