Search code examples
tensorflowmachine-learningkerasnlptokenize

How to choose num_words parameter for keras Tokenizer?


tokenizer = Tokenizer(num_words=my_max)

I am using the keras preprocessing tokenizer to process a corpus of text for a machine learning model. One of the parameters for the Tokenizer is the num_words parameter that defines the number of words in the dictionary. How should this parameter be picked? I could choose a huge number and guarantee that every word would be included but certain words that only appears once might be more useful if grouped together as a simple "out of vocabulary" token. What is the strategy in setting this parameter?

My particular use case is a model processing tweets so every entry is below 140 characters and there is some overlap in the types of words that are used. the model is for a kaggle competition about extracting the text that exemplifies a sentiment (i.e "my boss is bullying me" returns "bullying me")


Solution

  • The base question here is "What kinds of words establish sentiment, and how often do they occur in tweets?"

    Which, of course, has no hard and fast answer.

    Here's how I would solve this:

    1. Preprocess your data so you remove conjunctions, stop words, and "junk" from tweets.
    2. Get the number of unique words in your corpus. Are all of these words essential to convey sentiment?
    3. Analyze the words with the top frequencies. Are these words that convey sentiment? Could they be removed in your preprocessing? The tokenizer records the first N unique words until the dictionary has num_words in it, so these popular words are much more likely to be in your dictionary.

    Then, I would begin experimenting with different values, and see the effects on your output.

    Apologies for no "real" answer. I would argue that there is no single true strategy to choosing this value. Instead, the answer should come from leveraging the characteristics and statistics of your data.