Image of Bags and how to choose from them
Imagine I have 10 bags,Ordered one after other.ie Bag 1 , Bag 2 ......... Bag n.
Each bag has distinct set of words.
In order to understand what a bag is, Consider we have a vocabulary of 10,000 words. The first bag contains words Hello , India , Manager.
ie Bag 1 will have 1's at the words index present in the bag. ex:Bag 1 will be of size 10000*1 if Hello's index was 1 India's index was 2 and Manager's was 4 It will be [0 , 1, 1, 0 , 1 ,0,0,0,0.........]
*I dont have a model yet. *I'm thinking to use story books,But its still kind of abstract for me.
A word has to chosen from each bag and assigned a number word 1(word from bag 1) word 2(word from bag 2) and they must form a MEANINGFULL sentence in their numerical order.!
Have a database
with thousands/millions of valid sentences.
Create a dictionary where each word represents a number (reserve 0 for "nothing", 1 for "start of sentence" and 2 for "end of sentence").
word_dic = { "_nothing_": 0, "_start_": 1, "_end_": 2, "word1": 3, "word2": 4, ...}
reverse_dic = {v:k for k,v in word_dic.items()}
Remember to add "_start_"
and "_end_"
at the beginning and end of all sentences in the database, and "_nothing_"
after the end to complete the desired length capable of containing all sentences. (Ideally, work with sentences with 10 or less words, so your model wont't try to create bigger sentences).
Transform all your sentences into sequences of indices:
#supposing you have an array of shape (sentences, length) as string:
indices = []
for word in database.reshape((-1,)):
indices.append(word_dic[word])
indices = np.array(indices).reshape((sentences,length))
Transform this into categorical words with the keras function to_categorical()
cat_sentences = to_categorical(indices) #shape (sentences,length,dictionary_size)
Hint: keras has lots of useful text preprocessing functions here.
Separate training input and output data:
#input is the sentences except for the last word
x_train = cat_sentences[:,:-1,:]
y_train = cat_sentences[:,1:,:]
Let's create an LSTM based model that will predict the next words from the previous words:
model = Sequential()
model.add(LSTM(dontKnow,return_sequences=True,input_shape=(None,dictionary_size)))
model.add(.....)
model.add(LSTM(dictionary_size,return_sequences=True,activation='sigmoid'))
#or a Dense(dictionary_size,activation='sigmoid')
Compile and fit this model with x_train
and y_train
:
model.compile(....)
model.fit(x_train,y_train,....)
Create an identical model using stateful=True
in all LSTM
layers:
newModel = ......
Transfer the weights from the trained model:
newModel.set_weights(model.get_weights())
Create your bags in a categorical way, shape (10, dictionary_size)
.
Use the model to predict one word from the _start_
word.
#reset the states of the stateful model before you start a 10 word prediction:
newModel.reset_states()
firstWord = newModel.predict(startWord) #startword is shaped as (1,1,dictionary_size)
The firstWord
will be a vector with size dictionary_size
telling (sort of) the probabilities of each existing word. Compare to the words in the bag. You can choose the highest probability, or use some random selecting if the probabilities of other words in the bag are also good.
#example taking the most probable word:
firstWord = np.array(firstWord == firstWord.max(), dtype=np.float32)
Do the same again, but now input firstWord
in the model:
secondWord = newModel.predict(firstWord) #respect the shapes
Repeat the process until you get a sentence. Notice that you may find _end_
before the 10 words in the bag are satisfied. You may decide to finish the process with a shorter sentence then, especially if other word probabilities are low.