I initially tried making an RNN that can predict Shakespeare text, and I did it successfully using character level-encoding. But when I switched to word level encoding, I ran into a multitude of issues. Specifically, I am having a hard time getting the total number of characters (I was told it was just dataset_size = tokenizer.document_count but this just returns 1 ) so that I can set steps_per_epoch = dataset_size // batch_size when fitting my model (Now, both char and word level encoding return 1). I tried setting dataset_size = sum(tokenizer.word_counts.values()) but when I fit the model, I get this error right before the first epoch ends:
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 32 batches). You may need to use the repeat() function when building your dataset.
So I assume that my code believes that I have slightly more training sets available than I actually do. Or it may be the fact that I am programming on the new M1 chip which doesn't have a production version of TF? So really, I'm just not sure how to get the exact number of words in this text.
Here's the code:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import re
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
shakespeare_text = f.read()
tokenizer = keras.preprocessing.text.Tokenizer(char_level=False) #Set to word-level encoding
tokenizer.fit_on_texts([shakespeare_text])
max_id = len(tokenizer.word_index) # number of distinct characters
#dataset_size = sum(tokenizer.word_counts.values()) #Returns 204089
dataset_size = tokenizer.document_count # Returns 1
Thanks:)
The count of all words found in the input text is stored in an OrderedDict tokenizer.word_counts
. It looks like
OrderedDict([('first', 362), ('citizen', 100), ('before', 195), ('we', 862), ('proceed', 21), ('any', 189), ('further', 36), ('hear', 230), ...])
So, to get the word count number, you need
sum([x for _,x in tokenizer.word_counts.items()])