I trained model with Random Forest Classifier. I saved this model using pickle. Then, in different python file, I preprocessed a sentence from input (I vectorized it in Bag of Words and then in TF-IDF). After that I used train_test_split
with the parameter test_size=1
to make this sentence look like a test data. When I give this test data to my trained model it says:
ValueError: X has 14 features, but RandomForestClassifier is expecting 148409 features as input
Probably it's because i used dataset to train my model and now it's only 1 sample. But how am I supposed to use my model if 1 sample array (or matrix) doesn't have the same shape as an array with thousands samples from dataset? Shapes while training:
train dataset features size: (23588, 148409)
train dataset label size: (23588,)
test dataset features size: (10110, 148409)
test dataset label size: (10110,)
Shape of one sentence when I try to use my model (as an example):
text_test shape (15, 14)
Code in training (building) python file:
from sklearn.feature_extraction.text import CountVectorizer, TfidTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
vectorizer = CountVectorizer()
BoW_transformer = vectorizer.fit(data['Text'])
BoW_data = BoW_transformer.transform(data['Text'])
tf_idf_transformer = TfidfTransformer().fit(BoW_data)
data_tf_idf = tf_idf_transformer.transform(BoW_data)
text_train, text_test, label_train, label_test = train_test_split(
data_tf_idf, data['Label'], test_size=0.3
)
print(f"train dataset features size: {text_train.shape}")
print(f"train dataset label size: {label_train.shape}")
print(f"test dataset features size: {text_test.shape}")
print(f"test dataset label size: {label_test.shape}")
RF_classifier = RandomForestClassifier()
RF_classifier.fit(text_train, label_train)
predict_train = RF_classifier.predict(text_train)
predict_test = RF_classifier.predict(text_test)
Code in 'use' python file:
import pickle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
vectorizer = CountVectorizer()
BoW_transformer = vectorizer.fit(input_string)
BoW_data = BoW_transformer.transform(input_string)
tf_idf_transformer = TfidfTransformer().fit(BoW_data)
data_tf_idf = tf_idf_transformer.transform(BoW_data)
text_test, label_test = train_test_split(
data_tf_idf, test_size=1
)
print("text_test shape", text_test.shape)
with open("saved_model.pickle", 'rb') as f:
RF_classifier = pickle.load(f)
predict_test = RF_classifier.predict(text_test)
I tried to put messages in array when I use fit() but I get either an error or my computer freezes (probably my RAM is not enough to train model with numpy arrays) I tried to reshape it but I cannot reshape array with sum=210 into array with sum=3000000...
I got an answer on Russian Stackoverflow: https://ru.stackoverflow.com/questions/1493423/Почему-возникает-ошибка-valueerror
Translated:
The problem is in the preparation of features. You need to prepare all features identically.
If you are using transformers with internal state and using fit()
on the data you are using to train the model then you need to save the transformers with pickle
, just like your model. And when you want to make a prediction, you need to read the transformers and use only transform()
, no fit()
.
Also there are some transformers without internal state. For example, you can use HashingVectorizer()
instead of CountVectorizer()
. HashingVectorizer()
doesn't have internal state, so it consumes less memory than CountVectorizer()
and you don't need to save it, you just need to initialize it with the same arguments.