Search code examples
pythonnlptext-classificationnaivebayesnon-english

KeyError on a certain word


I am trying to use Naive Bayes for spam-ham classification.

training_set['E_Mail'] = training_set['E_Mail'].str.split()
vocabulary = []
for email in training_set['E_Mail']:
 for word in email:
     vocabulary.append(tuple(word))

vocabulary = list(set(vocabulary))


word_counts_per_email = {unique_word: [0] * len(training_set['E_Mail']) for unique_word in vocabulary}

for index, email in enumerate(training_set['E_Mail']):
 for word in email:
   word_counts_per_email[word][index] += 1

I am getting a word error repeteadly on here:

word_counts_per_email = {unique_word: [0] * len(training_set['E_Mail']) for unique_word in vocabulary}

for index, email in enumerate(training_set['E_Mail']):
 for word in email:
   word_counts_per_email[word][index] += 1

The error message is just this:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-30-1706354aaff0> in <module>()
     3 for index, email in enumerate(training_set['E_Mail']):
     4   for word in email:
----> 5     word_counts_per_email[word][index] += 1

KeyError: 'hafta'

'hafta' is the first word of the pandas dataframe and the trainng dataset.

I tried the solution on this issue that seemed similar to mine but it didn't work out.

I will appreciate any hint to get this over, thank you.


Solution

  • My guess is that this line vocabulary.append(tuple(word)) should be changed to vocabulary.append(word) since your version might put letters instead of words into vocabulary and therefore word_counts_per_email.

    In case this doesn't work, I suggest looking into contents of vocabulary/ word_counts_per_email so you can determine what went wrong.