CNTK: How do I initialize LSTM hidden state?

I'm trying convert a working image captioning CNN-LSTM network from TensorFlow to CNTK, and have what I think is a correctly trained model, but am having trouble figuring out how to extract predictions from the final trained CNTK model.

This is the general architecture I'm working with: This is my CNTK model:

def create_lstm_model(image_features, text_features):
    embedding_dim = 512
    hidden_dim = 512
    cell_dim = 512
    vocab_dim = 77

    image_embedding = Embedding(embedding_dim)
    text_embedding = Embedding(embedding_dim)
    lstm_classifier = Sequential([Stabilizer(),
                                  Recurrence(LSTM(hidden_dim)),
                                  Recurrence(LSTM(hidden_dim)),
                                  Stabilizer(),
                                  Dense(vocab_dim)]) 

    embedded_images = BatchNormalization()(image_embedding(image_features))
    embedded_text = text_embedding(text_features)
    lstm_input = C.plus(embedded_images, embedded_text)
    lstm_input = C.dropout(lstm_input, 0.5)
    output = lstm_classifier(lstm_input)    

    return output

I'm providing my data in CTF format, with fixed caption sequence sizes of 40, using this structure:

def create_reader(path, is_training):
    return MinibatchSource(CTFDeserializer(path, StreamDefs(
        target_tokens = StreamDef(field='target_tokens', shape=vocab_len, is_sparse=True),
        input_tokens = StreamDef(field='input_tokens', shape=vocab_len, is_sparse=True),
        image_features = StreamDef(field='image_features', shape=image_features_dim, is_sparse=False)
    )), randomize = is_training, max_sweeps = INFINITELY_REPEAT if is_training else 1)

Aside: the reason for three streams of data - I have an input image feature vector (last 2048-dim layer of a pre-trained ResNet), a sequence of input text tokens, and a sequence of output text tokens. So basically my CTF file, in terms of sequences, looks like:

0 | target_token_0  | input_token_0 | input_image_feature_vector (2048-dim)
0 | target_token_1  | input_token_1 | empty array of 2048 zeros
0 | target_token_2  | input_token_2 | empty array of 2048 zeros
...
0 | target_token_40 | input_token_40 | empty array of 2048 zeros
1 | target_token_0  | input_token_0 | input_image_feature_vector (2048-dim)
1 | target_token_1  | input_token_1 | empty array of 2048 zeros
1 | target_token_2  | input_token_2 | empty array of 2048 zeros
...
1 | target_token_40 | input_token_40 | empty array of 2048 zeros

Basically, I couldn't figure out how to slice & splice two sequences together in CNTK (even though you can splice two tensors easily), so I'm hacking around it by providing only the first element in a sequence with an input 2048-dim image feature vector, and the remaining elements in a sequence with an empty 2048-dim vector of zeros - setup for:

C.plus(embedded_images, embedded_text)

in the model above - where the goal is to essentially take the first element of a sequence of 40 [2048]->[512] image embeddings and hack-splice(TM) it in front of the last 39 elements of a sequence of 40 [vocab_dim]->[512] word embeddings. I'm counting on pretty empty [2048]->[512] image embeddings being learned for the empty image vectors (2048 zeros), so I'm taking my embedded image sequence and element-wise adding it to my embedded text sequence before all goes into the LSTM. Basically, this:

image embedding sequence: [-1, 40, 512]  (e.g., [-1, 0, 512])
text embedding sequence:  [-1, 40, 512]  (e.g., [-1, 1:40, 512)
+
---------------------------------------
lstm input sequence:      [-1, 40, 512]

Which brings me to my actual question. Now that I have a model that trains decently well, I'd like to extract caption predictions from the model, basically doing something like this (from this PyTorch image captioning tutorial):

def sample(self, features, states=None):
    """Samples captions for given image features (Greedy search)."""
    sampled_ids = []
    inputs = features.unsqueeze(1)
    for i in range(20):                                      # maximum sampling length
        hiddens, states = self.lstm(inputs, states)          # (batch_size, 1, hidden_size), 
        outputs = self.linear(hiddens.squeeze(1))            # (batch_size, vocab_size)
        predicted = outputs.max(1)[1]
        sampled_ids.append(predicted)
        inputs = self.embed(predicted)
        inputs = inputs.unsqueeze(1)                         # (batch_size, 1, embed_size)
    sampled_ids = torch.cat(sampled_ids, 1)                  # (batch_size, 20)
    return sampled_ids.squeeze()

The problem is, I can't figure out the CNTK equivalent for getting the hidden state out of an LSTM and pumping it back in next time step:

hiddens, states = self.lstm(inputs, states)

How does this work in CNTK?

Solution

I think the function you are looking is RecurrenceFrom(). Its documentation contains the following example:

Example:
 >>> from cntk.layers import *
 >>> from cntk.layers.typing import *

 >>> # a plain sequence-to-sequence model in training (where label length is known)
 >>> en = C.input_variable(**SequenceOver[Axis('m')][SparseTensor[20000]])  # English input sentence
 >>> fr = C.input_variable(**SequenceOver[Axis('n')][SparseTensor[30000]])  # French target sentence

 >>> embed = Embedding(300)
 >>> encoder = Recurrence(LSTM(500), return_full_state=True)
 >>> decoder = RecurrenceFrom(LSTM(500))       # decoder starts from a data-dependent initial state, hence -From()
 >>> emit = Dense(30000)
 >>> h, c = encoder(embed(en)).outputs         # LSTM encoder has two outputs (h, c)
 >>> z = emit(decoder(h, c, sequence.past_value(fr)))   # decoder takes encoder outputs as initial state
 >>> loss = C.cross_entropy_with_softmax(z, fr)