python nlp text-classification naivebayes non-english

KeyError on a certain word

I am trying to use Naive Bayes for spam-ham classification.

training_set['E_Mail'] = training_set['E_Mail'].str.split()
vocabulary = []
for email in training_set['E_Mail']:
 for word in email:
     vocabulary.append(tuple(word))

vocabulary = list(set(vocabulary))


word_counts_per_email = {unique_word: [0] * len(training_set['E_Mail']) for unique_word in vocabulary}

for index, email in enumerate(training_set['E_Mail']):
 for word in email:
   word_counts_per_email[word][index] += 1

I am getting a word error repeteadly on here:

word_counts_per_email = {unique_word: [0] * len(training_set['E_Mail']) for unique_word in vocabulary}

for index, email in enumerate(training_set['E_Mail']):
 for word in email:
   word_counts_per_email[word][index] += 1

The error message is just this:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-30-1706354aaff0> in <module>()
     3 for index, email in enumerate(training_set['E_Mail']):
     4   for word in email:
----> 5     word_counts_per_email[word][index] += 1

KeyError: 'hafta'

'hafta' is the first word of the pandas dataframe and the trainng dataset.

I tried the solution on this issue that seemed similar to mine but it didn't work out.

I will appreciate any hint to get this over, thank you.

Solution

My guess is that this line vocabulary.append(tuple(word)) should be changed to vocabulary.append(word) since your version might put letters instead of words into vocabulary and therefore word_counts_per_email.

In case this doesn't work, I suggest looking into contents of vocabulary/ word_counts_per_email so you can determine what went wrong.