Search code examples
deep-learningkerasnltklanguage-modelgoogle-natural-language

How to build deep learning model that picks words from serval distinct bags and forms a meaningful sentence


Image of Bags and how to choose from them

Imagine I have 10 bags,Ordered one after other.ie Bag 1 , Bag 2 ......... Bag n.

Each bag has distinct set of words.

In order to understand what a bag is, Consider we have a vocabulary of 10,000 words. The first bag contains words Hello , India , Manager.

ie Bag 1 will have 1's at the words index present in the bag. ex:Bag 1 will be of size 10000*1 if Hello's index was 1 India's index was 2 and Manager's was 4 It will be [0 , 1, 1, 0 , 1 ,0,0,0,0.........]

*I dont have a model yet. *I'm thinking to use story books,But its still kind of abstract for me.

A word has to chosen from each bag and assigned a number word 1(word from bag 1) word 2(word from bag 2) and they must form a MEANINGFULL sentence in their numerical order.!


Solution

  • Have a database with thousands/millions of valid sentences.

    Create a dictionary where each word represents a number (reserve 0 for "nothing", 1 for "start of sentence" and 2 for "end of sentence").

    word_dic = { "_nothing_": 0, "_start_": 1, "_end_": 2, "word1": 3, "word2": 4, ...}
    reverse_dic = {v:k for k,v in word_dic.items()}
    

    Remember to add "_start_" and "_end_" at the beginning and end of all sentences in the database, and "_nothing_" after the end to complete the desired length capable of containing all sentences. (Ideally, work with sentences with 10 or less words, so your model wont't try to create bigger sentences).

    Transform all your sentences into sequences of indices:

     #supposing you have an array of shape (sentences, length) as string:
     indices = []
     for word in database.reshape((-1,)):
         indices.append(word_dic[word])
     indices = np.array(indices).reshape((sentences,length))
    

    Transform this into categorical words with the keras function to_categorical()

     cat_sentences = to_categorical(indices) #shape (sentences,length,dictionary_size)
    

    Hint: keras has lots of useful text preprocessing functions here.

    Separate training input and output data:

    #input is the sentences except for the last word
    x_train = cat_sentences[:,:-1,:]
    y_train = cat_sentences[:,1:,:]
    

    Let's create an LSTM based model that will predict the next words from the previous words:

    model = Sequential()
    model.add(LSTM(dontKnow,return_sequences=True,input_shape=(None,dictionary_size)))
    model.add(.....)
    model.add(LSTM(dictionary_size,return_sequences=True,activation='sigmoid')) 
       #or a Dense(dictionary_size,activation='sigmoid')
    

    Compile and fit this model with x_train and y_train:

    model.compile(....)
    model.fit(x_train,y_train,....)
    

    Create an identical model using stateful=True in all LSTM layers:

    newModel = ...... 
    

    Transfer the weights from the trained model:

    newModel.set_weights(model.get_weights())
    

    Create your bags in a categorical way, shape (10, dictionary_size).

    Use the model to predict one word from the _start_ word.

    #reset the states of the stateful model before you start a 10 word prediction:
    newModel.reset_states()
    
    firstWord = newModel.predict(startWord) #startword is shaped as (1,1,dictionary_size)
    

    The firstWord will be a vector with size dictionary_size telling (sort of) the probabilities of each existing word. Compare to the words in the bag. You can choose the highest probability, or use some random selecting if the probabilities of other words in the bag are also good.

    #example taking the most probable word:
    firstWord = np.array(firstWord == firstWord.max(), dtype=np.float32)
    

    Do the same again, but now input firstWord in the model:

    secondWord = newModel.predict(firstWord) #respect the shapes
    

    Repeat the process until you get a sentence. Notice that you may find _end_ before the 10 words in the bag are satisfied. You may decide to finish the process with a shorter sentence then, especially if other word probabilities are low.