I implemented a generator function to yield one hot encoded vectors but the generator is actually throwing errors
I went for generator function to yield one hot encoded vectors because the latter would be used as inputs of deep learning lstm model. I am doing this to avoid excessive load and memory failures when trying to create one hot encoding out on data sets of very large size. However, i am not getting errors with the generator function. I need help to figure out where i am going wrong.
Code before:
X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)
y = np.zeros((len(sequences), vocab_size), dtype=np.bool)
for i, sentence in enumerate(sequences):
for t, word in enumerate(sentence):
X[i, t, vocab[word]] = 1
y[i, vocab[next_words[i]]] = 1
Here,
sequences = sentences generated from data set
seq_length = length of each sentence(this is constant)
vocab_size = number of unique words in dictionary
My program when run on the large data set produces,
sequences = 44073315
seq_length = 30
vocab_size = 124958
So, when the above code is directly used on the latter inputs, it gives beloe error.
Traceback (most recent call last):
File "1.py", line 206, in <module>
X = np.zeros((len(sequences), seq_length, vocab_size), dtype=np.bool)
MemoryError
(my_env) [rjagannath1@login ~]$
So, i tried creating a generator function(for testing) as below -
def gen(batch_size, no_of_sequences, seq_length, vocab_size):
bs = batch_size
ns = no_of_sequences
X = np.zeros((batch_size, seq_length, vocab_size), dtype=np.bool)
y = np.zeros((batch_size, vocab_size), dtype=np.bool)
while(ns > bs):
for i, sentence in enumerate(sequences):
for t, word in enumerate(sentence):
X[i, t, vocab[word]] = 1
y[i, vocab[next_words[i]]] = 1
print(X.shape())
print(y.shape())
yield(X, y)
ns = ns - bs
for item in gen(1000, 44073315, 30, 124958):
print(item)
But i get the below error -
File "path_of_file", line 247, in gen
X[i, t, vocab[word]] = 1
IndexError: index 1000 is out of bounds for axis 0 with size 1000
What mistake am i doing in the generator function?
Modify as follows in your generator:
batch_i = 0
while(ns > bs):
s = batch_i*batch_size
e = (batch_i+1)*batch_size
for i, sentence in enumerate(sequences[s:e]):
Basically, you want to run over windows of size batch_size
so you are making a running slice through sequences
which appears to be your entire dataset.
you also have to increment batch_i
, place that just after yield
, so add
batch_i+=1