python machine-learning scikit-learn nlp xgboost

Do I need to retrain an NLP model everytime because of incompatible shape after transforming?

I'm trying to build an NLP model that uses XGBoost. In my following code.

loaded_model = joblib.load('fraud.sav')
def clean_data(user_input):
    '''
    Cleaning data, removing digits, punctuation etc.
    '''
    return data

data_processed = clean_data(data_raw)
input_cleaned = clean_data(user_data)

total_data = pd.concat([data_processed,input_cleaned])

vectorizer=TfidfVectorizer(strip_accents='unicode',
                            analyzer='word',
                            ngram_range=(1, 2),
                            max_features=15000,
                            smooth_idf=True,
                            sublinear_tf=True)
vectorizer.fit(total_data['text'])

X_training_vectorized = vectorizer.transform(total_data['text'])
X_test = vectorizer.transform(input_cleaned['text'])

pca = PCA(n_components=0.95)
pca.fit(X_training_vectorized.toarray())
X_test_pca = pca.transform(X_test.toarray())

y_test = loaded_model.predict(X_test_pca)

What I don't understand is, I previously trained my dataset with 10000+ data and got good results. I then decide to save the model so that I can make predictions using user data. My model detects whether a text document is fraudulent or real. I have a dataset thats labelled for fraudulent data.

I understand that when transforming data, vectorizer and pca should both be fitted to our whole dataset so that it will result in the same shape.

What I dont understand is, how do I make it such that, I can transform user input, and have it the same shape as my model that I pretrained? Whats the proper procedure for this? Would love answers that also consider performance/time needed to process the data.

Solution

This is done automatically as part of the CountVectorizer - tokens which appear in a new dataset but not in the data it is fit upon are simply ignored and the shape of the output will remain the same.

For example:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

data = ['the cat in the hat', 'cats like milk', 'cats and rats']

cv = CountVectorizer()
dtm = cv.fit_transform(data)
pd.DataFrame(dtm.todense(), columns=cv.get_feature_names_out())

and	cat	cats	hat	in	like	milk	rats	the
0	1	0	1	1	0	0	0	2
0	0	1	0	0	1	1	0	0
1	0	1	0	0	0	0	1	0

Now if we pass new data to the fitted CountVectorizer that includes tokens it hasn't seen before (birds, dogs) they are ignored, and the dimensionality of the document-term matrix remains the same:

data2 = ['dogs and cats', 'birds and cats', 'dogs and birds']
dtm = cv.transform(data2)
pd.DataFrame(dtm.todense(), columns=cv.get_feature_names_out())

and	cat	cats	hat	in	like	milk	rats	the
1	0	1	0	0	0	0	0	0
1	0	1	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0

Since unseen tokens are ignored, this stresses the importance of retraining and/or consistency in the distribution of tokens in your training data and any data the model is used upon.

Additionally, I would also avoid using floating point value for n_components for PCA but instead pick a set number of components (pass an integer value as opposed to a float) so that the output dimensionality of preprocessing is consistent.