Search code examples
pythonscikit-learnnlpknntweets

Query data dimension must match training data dimension


I'm developing a tweet classifier. I trained a knn clasiffier with a a tfidf dataset in which each row has a length of 3.173, after training the model a load it into a file so that I can classify new tweets.

The problem is that every time I extract new tweets and try to classify them the tfidf lenths varies dependending on the vocabulary of new extracted tweets, so it is impossible for the model to classify those new tweets.

I've been searching and trying to solve this for two days but did not find an efficient solution. How can I adapt the dimension of the querying data to the dimension of the training data efficiently???

Here is my code:

 #CLASIFICA TWEETS TASS TEST
    clf = joblib.load('files/model_knn_pos.sav')

    #Carga los tweets
    dfNew = pd.read_csv(f'files/tweetsTASStestCaract.csv', encoding='UTF-8',sep='|')

    #Preprocesa 
    prepro = Preprocesado()
    dfNew['clean_text'] = prepro.procesa(dfNew['tweet'])

    #Tercer excluso
    dfNew['type'].replace(['NEU','N','NONE'], 'NoPos', inplace=True)

    #Funcion auxiliar para crear los vectores
    def tokenize(s):
        return s.split()

    #Creo un vector por cada tweet, tendré en cuenta las palabras q aparezcan al menos 3 veces
    vect = TfidfVectorizer(tokenizer=tokenize, ngram_range=(1, 2), max_df=0.75, min_df=3, sublinear_tf=True)
    muestra = vect.fit_transform(dfNew['clean_text']).toarray().tolist()

    #Caracterizo los tweets a clasificar
    for i in range(len(muestra)):
            caract=dfNew.drop(columns=['tweet','clean_text','type']).values[i]
            muestra[i].extend(caract)

    #Clasifica pos
    y_train=dfNew['type'].values
    resultsPos = clf.predict(muestra)
    print(Counter(resultsPos))  

And this is the error I get:

File "sklearn/neighbors/binary_tree.pxi", line 1294, in sklearn.neighbors.kd_tree.BinaryTree.query

ValueError: query data dimension must match training data dimension


Solution

  • The solution is simple:

    You need to use vect.fit_transform() with the training data. But, when using the test data, you need only to use vect.transform().