I'm developing a tweet classifier. I trained a knn clasiffier with a a tfidf dataset in which each row has a length of 3.173, after training the model a load it into a file so that I can classify new tweets.
The problem is that every time I extract new tweets and try to classify them the tfidf lenths varies dependending on the vocabulary of new extracted tweets, so it is impossible for the model to classify those new tweets.
I've been searching and trying to solve this for two days but did not find an efficient solution. How can I adapt the dimension of the querying data to the dimension of the training data efficiently???
Here is my code:
#CLASIFICA TWEETS TASS TEST
clf = joblib.load('files/model_knn_pos.sav')
#Carga los tweets
dfNew = pd.read_csv(f'files/tweetsTASStestCaract.csv', encoding='UTF-8',sep='|')
#Preprocesa
prepro = Preprocesado()
dfNew['clean_text'] = prepro.procesa(dfNew['tweet'])
#Tercer excluso
dfNew['type'].replace(['NEU','N','NONE'], 'NoPos', inplace=True)
#Funcion auxiliar para crear los vectores
def tokenize(s):
return s.split()
#Creo un vector por cada tweet, tendré en cuenta las palabras q aparezcan al menos 3 veces
vect = TfidfVectorizer(tokenizer=tokenize, ngram_range=(1, 2), max_df=0.75, min_df=3, sublinear_tf=True)
muestra = vect.fit_transform(dfNew['clean_text']).toarray().tolist()
#Caracterizo los tweets a clasificar
for i in range(len(muestra)):
caract=dfNew.drop(columns=['tweet','clean_text','type']).values[i]
muestra[i].extend(caract)
#Clasifica pos
y_train=dfNew['type'].values
resultsPos = clf.predict(muestra)
print(Counter(resultsPos))
And this is the error I get:
File "sklearn/neighbors/binary_tree.pxi", line 1294, in sklearn.neighbors.kd_tree.BinaryTree.query
ValueError: query data dimension must match training data dimension
The solution is simple:
You need to use vect.fit_transform()
with the training data. But, when using the test data, you need only to use vect.transform()
.