Search code examples
machine-learningnlpfeature-extraction

How to match features in new records for NLP BOW


I have a dataset that has 100,000 records

data in this dataset are 2 columns 1- Text 2- Class

When I apply BOW of my model I get big list of features

That is fine, I managed to work with them

my problem is after building the model and deploying.

now if a new text came with new words then the model wont work as it wokds in same feature structure

Example "This is a test, test is important" , Red "Adam pass a test", Green

so my final dataset is

This is a test important Adam pass class
 1    2 1 2    1          0    0   Red
 0    0 1 1    0          1    1   Green

once model created and got this text

"test and exam are similar", Yellow

in this case the set of features has new features which are

and exam are similar

the model will break coz these features never included in the training model

I wonder how to resolve this issue?


Solution

  • To handle this issue, a fixed vocabulary is used to convert text into a bag of words. The tokens which are OOV (out of vocabulary) are represented with a special <UNK> token.

    For example, let's define a vocabulary V

    V = ['this', 'is', 'a', 'test', 'pass', 'and', 'are', '<UNK>']
    

    Then, your sentences will be represented with the following vectors:

    s1 = "This is a test, test is important"  #important is OOV
    v1 = [1, 2, 1, 2, 0, 0, 0, 1]
    
    s2 = "Adam pass a test" # Adam is OOV
    v2 = [0, 0, 1, 1, 1, 0, 0, 1]
    

    When you represent your training data in bag of words and fit a model, the test data will be represented in the same way and your model will predict using this representation. In your case,

    s3 = "test and exam are similar"
    v3 = [0, 0, 0, 1, 0, 1, 1, 2]