I have a dataset that has 100,000 records
data in this dataset are 2 columns 1- Text 2- Class
When I apply BOW of my model I get big list of features
That is fine, I managed to work with them
my problem is after building the model and deploying.
now if a new text came with new words then the model wont work as it wokds in same feature structure
Example "This is a test, test is important" , Red "Adam pass a test", Green
so my final dataset is
This is a test important Adam pass class
1 2 1 2 1 0 0 Red
0 0 1 1 0 1 1 Green
once model created and got this text
"test and exam are similar", Yellow
in this case the set of features has new features which are
and exam are similar
the model will break coz these features never included in the training model
I wonder how to resolve this issue?
To handle this issue, a fixed vocabulary is used to convert text into a bag of words. The tokens which are OOV (out of vocabulary) are represented with a special <UNK>
token.
For example, let's define a vocabulary V
V = ['this', 'is', 'a', 'test', 'pass', 'and', 'are', '<UNK>']
Then, your sentences will be represented with the following vectors:
s1 = "This is a test, test is important" #important is OOV
v1 = [1, 2, 1, 2, 0, 0, 0, 1]
s2 = "Adam pass a test" # Adam is OOV
v2 = [0, 0, 1, 1, 1, 0, 0, 1]
When you represent your training data in bag of words and fit a model, the test data will be represented in the same way and your model will predict using this representation. In your case,
s3 = "test and exam are similar"
v3 = [0, 0, 0, 1, 0, 1, 1, 2]