python machine-learning scikit-learn text-classification

Sklearn text classification: Why is accuracy so low?

Alright, Im following https://medium.com/@phylypo/text-classification-with-scikit-learn-on-khmer-documents-1a395317d195 and https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html trying to classify text based on category. My dataframe is laid out like this and named result:

target   type    post
1      intj    "hello world shdjd"
2      entp    "hello world fddf"
16     estj   "hello world dsd"
4      esfp    "hello world sfs"
1      intj    "hello world ddfd"

The goal would be to categorize a post by its type, and target just assigns number 1-16 to each of the 16 types. To classify the text I do this:

result = result[:1000] #shorten df - was :600

# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], result['type'], test_size=0.30, random_state=1)

# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

def tokenizersplit(str):
    return str.split()
tfidf_vect = TfidfVectorizer(tokenizer=tokenizersplit, encoding='utf-8', min_df=2, ngram_range=(1, 2), max_features=25000)

tfidf_vect.fit(result['post'])
tfidf_vect.transform(result['post'])

xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)

def train_model(classifier, trains, t_labels, valids, v_labels):
    # fit the training dataset on the classifier
    classifier.fit(trains, t_labels)

    # predict the labels on validation dataset
    predictions = classifier.predict(valids)

    return metrics.accuracy_score(predictions, v_labels)

# Naive Bayes
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print ("NB accuracy: ", accuracy)

# Logistic Regression
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print ("LR accuracy: ", accuracy)

And depending on how much I shorten result in the beginning, accuracy peaks at around 0.4 for all algorithms. It is supposed to be 0.8-0.9.

I read scikit very low accuracy on classifiers(Naive Bayes, DecissionTreeClassifier) but dont see how to apply it to my dataframe. My data is simple - has category (type) and text (post).

What is wrong here?

EDIT - naive bayes take 2:

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])
text_clf.fit(result.post, result.target)

docs_test = result.post
predicted = text_clf.predict(docs_test)
np.mean(predicted == result.target)

print("Naive Bayes: ")
print(np.mean(predicted == result.target))

Solution

What you are doing

The mistake I believe is in these lines:

encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

By fitting two times you reset the knowledge of the LabelEncoder.
In a more simple example:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
y_train = le.fit_transform(["class1", "class2", "class3"])
y_valid = le.fit_transform(["class2", "class3"])
print(y_train)
print(y_valid)

Outputs these label encodings:

[0 1 2]
[0 1]

This is wrong since the encoded label 0 is class1 for the training and class2 for the validation.

Fix

I would change your first lines to:

result = result[:1000] #shorten df - was :600

# Encode the labels before splitting
encoder = preprocessing.LabelEncoder()
y_encoded = encoder.fit_transform(result['type'])

# CARE that I changed the target from result['type'] to y_encoded
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], y_encoded, test_size=0.30, random_state=1)

def tokenizersplit(str):
    return str.split()

.
.
.