Search code examples
pythonpython-3.xmachine-learningscikit-learnknn

KNN for Text Classification using TF-IDF scores


I have a CSV file (corpus.csv) with graded abstracts (text) in the following format in corpus:

Institute,    Score,    Abstract


----------------------------------------------------------------------


UoM,    3.0,    Hello, this is abstract one

UoM,    3.2,    Hello, this is abstract two and yet counting.

UoE,    3.1,    Hello, yet another abstract but this is a unique one.

UoE,    2.2,    Hello, please no more abstract.

I am trying to create a KNN classification program in python, which is able to get an user input abstract such as, "This is a new unique abstract" and then classify this user input abstract closest to the corpus (CSV) and also returns the score/grade of the predicted abstract. How can I achieve that?

I have the following code:

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
from csv import reader,writer
import operator as op
import string

#Read data from corpus
r = reader(open('corpus.csv','r'))
abstract_list = []
score_list = []
institute_list = []
row_count = 0
for row in list(r)[1:]:
    institute,score,abstract = row
    if len(abstract.split()) > 0:
      institute_list.append(institute)
      score = float(score)
      score_list.append(score)
      abstract = abstract.translate(string.punctuation).lower()
      abstract_list.append(abstract)
      row_count = row_count + 1

print("Total processed data: ", row_count)

#Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4),
                     min_df = 0, stop_words = 'english', sublinear_tf=True)
response = vectorizer.fit_transform(abstract_list)
feature_names = vectorizer.get_feature_names()

In the aforementioned code, how can I use the features from TF-IDF computation for KNN classification as mentioned above? (Probably using sklearn.neighborsKNeighborsClassifier framework)

P.S. The classes for this applicative case are the respective scores/grades of the abstracts.

I have background in visual Deep Learning, however, I lack much knowledge in text classification, especially using KNN. Any help would be much appreciated. Thank you in advance.


Solution

  • KNN is a classification algorithm - meaning you have to have a class attribute. KNN can use the output of TFIDF as the input matrix - TrainX, but you still need TrainY - the class for each row in your data. However, you could use a KNN regressor. Use your scores as the class variable:

    from sklearn.feature_extraction.text import TfidfVectorizer
    from nltk.corpus import stopwords
    import numpy as np
    import pandas as pd
    from csv import reader,writer
    import operator as op
    import string
    from sklearn import neighbors
    
    #Read data from corpus
    r = reader(open('corpus.csv','r'))
    abstract_list = []
    score_list = []
    institute_list = []
    row_count = 0
    for row in list(r)[1:]:
        institute,score,abstract = row[0], row[1], row[2]
        if len(abstract.split()) > 0:
          institute_list.append(institute)
          score = float(score)
          score_list.append(score)
          abstract = abstract.translate(string.punctuation).lower()
          abstract_list.append(abstract)
          row_count = row_count + 1
    
    print("Total processed data: ", row_count)
    
    #Vectorize (TF-IDF, ngrams 1-4, no stop words) using sklearn -->
    vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,4),
                         min_df = 0, stop_words = 'english', sublinear_tf=True)
    response = vectorizer.fit_transform(abstract_list)
    classes = score_list
    feature_names = vectorizer.get_feature_names()
    
    clf = neighbors.KNeighborsRegressor(n_neighbors=1)
    clf.fit(response, classes)
    clf.predict(response)
    

    The "predict" will predict the score for each instance.