I have csv file which contains 2 column 'Complaint Details' and 'DispositionCode'.I want to classify the complaintDetails into 8 different classes of dispostionCode such as 'Door locked from inside','Vendor error','Missing key or lock'... The dataset is shown in the image. enter image description here
What would be good method to classify and find accuracy.
Initially I am trying with removing stopwords from the ComplaintDetails and then use naivebayes classifier
The code is as follows:
import csv
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
your_list=[]
with open('H:/Project/rash.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader)
print(your_list)
stop_words=set(stopwords.words("english"))
words= word_tokenize(your_list)
filteredSent=[]
for w in words:
if w not in stop_words:
filteredSent.append()
print(filteredSent)
But I am getting following error:-
for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or bytes-like object
Your code never gets to the stopwords, since the error is due to misusing word_tokenize()
. It needs to be called on a single string, not on your whole dataset. You can tokenize your data like this:
for row in your_list:
row[0] = word_tokenize(row[0])
You'll now need to rethink the rest of your code. You have a whole list of sentences, not just one. Use a loop like the above so you're examining the words of one sentence at a time.