Search code examples
python-3.xmachine-learningkerastheano

Keras fit_generator(), is this the correct usage?


So far I have come up with this hacky code here, this code runs and outputs

Epoch 10/10
       1/3000 [..............................] - ETA: 27s - loss: 0.3075 - acc: 0.7270
       6/3000 [..............................] - ETA: 54s - loss: 0.3075 - acc: 0.7355
.....
    2996/3000 [============================>.] - ETA: 0s - loss: 0.3076 - acc: 0.7337
    2998/3000 [============================>.] - ETA: 0s - loss: 0.3076 - acc: 0.7337
    3000/3000 [==============================] - 59s - loss: 0.3076 - acc: 0.7337    
    Traceback (most recent call last):
      File "C:/Users/Def/PycharmProjects/KerasUkExpenditure/TweetParsing.py", line 140, in <module>
        (loss, acc) = model.fit_generator(generator(tokenizer=t, startIndex=startIndex,batchSize=amountOfData),
    TypeError: 'History' object is not iterable

    Process finished with exit code 1

I'm confused by "'History' object is not iterable", what does this mean?

This is the first time I've tried to do batch training and testing and I'm not sure i've implemented it correctly as most the examples I've seen online are for images. Here is the code

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.preprocessing.text import Tokenizer
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt

import re

"""
amount of samples out to the 1 million to use, my 960m 2GB can only handel
about 30,000ish at the moment depending on the amount of neurons in the
deep layer and the amount fo layers.
"""
maxSamples = 3000

#Load the CSV and get the correct columns
data = pd.read_csv("C:\\Users\\Def\\Desktop\\Sentiment Analysis Dataset1.csv")
dataX = pd.DataFrame()
dataY = pd.DataFrame()
dataY[['Sentiment']] = data[['Sentiment']]
dataX[['SentimentText']] = data[['SentimentText']]

dataY = dataY.iloc[0:maxSamples]
dataX = dataX.iloc[0:maxSamples]

testY = dataY.iloc[-1: -maxSamples]
testX = dataX.iloc[-1: -maxSamples]


"""
here I filter the data and clean it up bu remove @ tags and hyper links and
also any characters that are not alpha numeric, I then add it to the vec list
"""
def removeTagsAndLinks(dataframe):
    vec = []
    for x in dataframe.iterrows():
        #Removes Hyperlinks
        zero = re.sub("(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?", "", x[1].values[0])
        #Removes @ tags
        one = re.sub("@\\w+", '', zero)
        #keeps only alpha-numeric chars
        two = re.sub("\W+", ' ', one)
        vec.append(two)
    return vec

vec = removeTagsAndLinks(dataX)
xTest = removeTagsAndLinks(testX)
yTest = removeTagsAndLinks(testY)
"""
This loop looks for any Tweets with characters shorter than 2 and once found write the
index of that Tweet to an array so I can remove from the Dataframe of sentiment and the
list of Tweets later
"""
indexOfBlankStrings = []
for index, string in enumerate(vec):
    if len(string) < 2:
        del vec[index]
        indexOfBlankStrings.append(index)

for row in indexOfBlankStrings:
    dataY.drop(row, axis=0, inplace=True)


"""
This makes a BOW model out of all the tweets then creates a
vector for each of the tweets containing all the words from
the BOW model, each vector is the same size becuase the
network expects it
"""

def vectorise(tokenizer, list):
    tokenizer.fit_on_texts(list)
    return tokenizer.texts_to_matrix(list)

#Make BOW model and vectorise it
t = Tokenizer(lower=False, num_words=1000)
dim = vectorise(t, vec)

xTest = vectorise(t, xTest)

"""
Here im experimenting with multiple layers of the total
amount of words in the syllabus divided by ^2 - This
has given me quite accurate results compared to random guess's
of amount of neron's and amounts of layers.
"""
l1 = int(len(dim[0]) / 4) #To big for my GPU
l2 = int(len(dim[0]) / 8) #To big for my GPU
l3 = int(len(dim[0]) / 16)
l4 = int(len(dim[0]) / 32)
l5 = int(len(dim[0]) / 64)
l6 = int(len(dim[0]) / 128)


#Make the model
model = Sequential()
model.add(Dense(l1, input_dim=dim.shape[1]))
model.add(Dropout(0.15))
model.add(Dense(l2))
model.add(Dense(l1))
model.add(Dense(l3))
model.add(Dropout(0.2))
model.add(Dense(l4))
model.add(Dense(1, activation='relu'))

#Compile the model
model.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['acc'])

"""
This here will use multiple batches to train the model.
    startIndex:
        This is the starting index of the array for which you want to
        start training the network from.
    dataRange:
        The number of elements use to train the network in each batch so
        since dataRange = 1000 this mean it goes from
        startIndex...dataRange OR 0...1000
    amountOfEpochs:
        This is kinda self explanitory, the more Epochs the more it
        is supposed to learn AKA updates the optimisation algo numbers
"""
amountOfEpochs = 10
dataRange = 1000
startIndex = 0

def generator(tokenizer, batchSize, totalSize=maxSamples, startIndex=0):
    f = tokenizer.texts_to_sequences(vec[startIndex:totalSize])
    l = np.asarray(dataY.iloc[startIndex:totalSize])
    while True:
        for i in range(1000, totalSize, batchSize):
            batch_features = tokenizer.sequences_to_matrix(f[startIndex: batchSize])
            batch_labels = l[startIndex: batchSize]
            yield batch_features, batch_labels

##This runs the model for batch AKA load a little them process then load a little more
for amountOfData in range(1000, maxSamples, 1000):
    #(loss, acc) = model.train_on_batch(x=dim[startIndex:amountOfData], y=np.asarray(dataY.iloc[startIndex:amountOfData]))
    (loss, acc) = model.fit_generator(generator(tokenizer=t, startIndex=startIndex,batchSize=amountOfData),
                                      steps_per_epoch=maxSamples, epochs=amountOfEpochs,
                                      validation_data=(np.array(xTest), np.array(yTest)))
    startIndex += 1000

The part towards the bottom is where I've tried to implement the fit_generator() and make my own generator, I wanted to load say 75,000 maxSamples then train the network 1000 samples at a time until it reaches the maxSample var which is why I've setup range to do the (0, maxSample, 1000) which I use in the generator() was this the correct use?

I ask because my network is not using the validation data and it seems to fit to the data extremely quickly which suggests overfitting or just using a very small dataset. am I iterating over all the maxSamples int he correct way? or am I just looping over the first iterations several times?

Thanks


Solution

  • The problem lies in this line:

    (loss, acc) = model.fit_generator(...)
    

    as fit_generator returns a single object of keras.callbacks.history class. That's why you have this error as singe object is not iterable. In order to get loss lists you need to retrieve them from history field in this callback which is a dictionary of recorded losses:

    history = model.fit_generator(...)
    
    loss = history.history["loss"]