I am having almost balanced dataset of 9 unique categories, each having almost 2200 rows with difference of +/-100 rows. To create model , i have used below mentioned urls approach but in each case my model accuracy is coming around 58% and precision/recall is also around 54%. Can you please let me know what wrong am I doing?
https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f https://towardsdatascience.com/machine-learning-multiclass-classification-with-imbalanced-data-set-29f6a177c1a
https://medium.com/@robert.salgado/multiclass-text-classification-from-start-to-finish-f616a8642538
My dataset is having only 2 columns , 1 as feature and other as label.
from pandas import ExcelFile
df = pd.read_excel('Prediction.xlsx',
sheet_name='Sheet1')
df.head()
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
import sys
!{sys.executable} -m pip install lxml
def clean_text(text):
"""
text: a string
return: modified initial string
"""
text = BeautifulSoup(text, "html.parser").text # HTML decoding
text = text.lower() # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
return text
df['notes_issuedesc'] = df['notes_issuedesc'].apply(clean_text)
print_plot(10)
df['notes_issuedesc'].apply(lambda x: len(x.split(' '))).sum()
X = df.notes_issuedesc
y = df.final
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state = 42)
%%time
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
nb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
nb.fit(X_train, y_train)
from sklearn.metrics import classification_report
y_pred = nb.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))
I was able to get my code to work by first correcting my data.
The issue was that there was lot of missing data, so I used mean values to fill these missing values. I also utilized a scatter chart to identify outlier data and then removed those as well.
I have performed few data wrangling operations and it generated with higher accuracy.