tokenizer = Tokenizer(num_words=my_max)
I am using the keras preprocessing tokenizer to process a corpus of text for a machine learning model. One of the parameters for the Tokenizer is the num_words parameter that defines the number of words in the dictionary. How should this parameter be picked? I could choose a huge number and guarantee that every word would be included but certain words that only appears once might be more useful if grouped together as a simple "out of vocabulary" token. What is the strategy in setting this parameter?
My particular use case is a model processing tweets so every entry is below 140 characters and there is some overlap in the types of words that are used. the model is for a kaggle competition about extracting the text that exemplifies a sentiment (i.e "my boss is bullying me" returns "bullying me")
The base question here is "What kinds of words establish sentiment, and how often do they occur in tweets?"
Which, of course, has no hard and fast answer.
Here's how I would solve this:
Then, I would begin experimenting with different values, and see the effects on your output.
Apologies for no "real" answer. I would argue that there is no single true strategy to choosing this value. Instead, the answer should come from leveraging the characteristics and statistics of your data.