Search code examples
pythontensorflowneural-networkchatbottflearn

How to create a validation_set for TFLearn?


I'm trying to create a validation_set for this chatbot tutorial: Contextual Chatbots with Tensorflow

But I'm having issues with the shape of my data, this is the method I'm using to create both my train and validation sets:

words = []
classes = []
documents = []
ignore_words = ['?']
# loop through each sentence in our intents patterns
for intent in intents['intents']:
    for pattern in intent['patterns']:
        # tokenize each word in the sentence
        w = nltk.word_tokenize(pattern)
        # add to our words list
        words.extend(w)
        # add to documents in our corpus
        documents.append((w, intent['tag']))
        # add to our classes list
        if intent['tag'] not in classes:
            classes.append(intent['tag'])

# stem and lower each word and remove duplicates
words = [stemmer.stem(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))

# remove duplicates
classes = sorted(list(set(classes)))


# create our training data
training = []
output = []
# create an empty array for our output
output_empty = [0] * len(classes)

# training set, bag of words for each sentence
for doc in documents:
    # initialize our bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # stem each word
    pattern_words = [stemmer.stem(word.lower()) for word in pattern_words]
    # create our bag of words array
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)

    # output is a '0' for each tag and '1' for current tag
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1

    training.append([bag, output_row])

# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)

# create train and test lists
x = list(training[:,0])
y = list(training[:,1])

I run this two times with different data and get my training and validation sets. The problem is that I initiate my tensorflow with the shape of my training set:

 net = tflearn.input_data(shape=[None, len(train_x[0])])

So when I go fit the model:

model.fit(train_x, train_y, n_epoch=1000,snapshot_step=100, snapshot_epoch=False, validation_set=(val_x,val_y), show_metric=True)

I get this error:

ValueError: Cannot feed value of shape (23, 55) for Tensor 'InputData/X:0', which has shape '(?, 84)'

Where 23 is the number of questions and 55 the number of unique words of my validation set. And 84 is the number of unique words in the training set.

Because my validation set has a different number of questions/unique words from my training set, I cant validate my training.

Can someone help me creating a valid validation set that indepents from the number of questions? I'm new to Tensorflow and Tflearn so any help would be great.


Solution

  • To the best of my understanding, this is what you did: You created a dictionary called words which contains all possible words in a dataset. Then while creating a training dataset, you searched each word of a question in that dictionary words and if it was there you added 1 to your bag of words and 0 otherwise. Issue here is that each question will have different number of words and hence different number of 1's and 0's.

    You can get around it by doing the reverse thing: Search each word of the dictionary words in that question of training set and if it's there, add 1 to your bag of words and 0 otherwise. This way all questions will become of same length(=length of dictionary words). Your training set will have dimension now (num_of_questions_in_training, len(words).

    Same thing can be done for validation set too: Search each word of the dictionary words in that question of validation set and if it's there, add 1 to your bag of words and 0 otherwise. Again, this way, your validation set will have dimension now (num_of_questions_in_validation, len(words) which solves your problem of dimension mismatch.

    So assuming there are 90 words in words, training_set_shape: (?,90), validation_set_shape: (23,90).