Search code examples
pythonpython-3.xtensorflowtokenize

Getting the number of words from tf.Tokenizer after fitting


I initially tried making an RNN that can predict Shakespeare text, and I did it successfully using character level-encoding. But when I switched to word level encoding, I ran into a multitude of issues. Specifically, I am having a hard time getting the total number of characters (I was told it was just dataset_size = tokenizer.document_count but this just returns 1 ) so that I can set steps_per_epoch = dataset_size // batch_size when fitting my model (Now, both char and word level encoding return 1). I tried setting dataset_size = sum(tokenizer.word_counts.values()) but when I fit the model, I get this error right before the first epoch ends:

WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 32 batches). You may need to use the repeat() function when building your dataset.

So I assume that my code believes that I have slightly more training sets available than I actually do. Or it may be the fact that I am programming on the new M1 chip which doesn't have a production version of TF? So really, I'm just not sure how to get the exact number of words in this text.

Here's the code:

import tensorflow as tf
from tensorflow import keras
import numpy as np
import re 

shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)

with open(filepath) as f:
    shakespeare_text = f.read()

tokenizer = keras.preprocessing.text.Tokenizer(char_level=False) #Set to word-level encoding
tokenizer.fit_on_texts([shakespeare_text])

max_id = len(tokenizer.word_index) # number of distinct characters
#dataset_size = sum(tokenizer.word_counts.values()) #Returns 204089
dataset_size = tokenizer.document_count # Returns 1

Thanks:)


Solution

  • The count of all words found in the input text is stored in an OrderedDict tokenizer.word_counts. It looks like

    OrderedDict([('first', 362), ('citizen', 100), ('before', 195), ('we', 862), ('proceed', 21), ('any', 189), ('further', 36), ('hear', 230), ...])
    

    So, to get the word count number, you need

    sum([x for _,x in tokenizer.word_counts.items()])