tensorflow tensorflow-serving tensorflow-datasets tensorflow-estimator

Tensorflow custom Estimator with Dataset API: embedding lookup (feature_column) NMT task

my question is close in nature to Feature Columns Embedding lookup, however I was unable to comment on the answer given there (not enough rep), and I think the answerer either did not fully understand the question, or the answer was not exactly what was asked.

GOAL

To serve a custom Estimator which uses Dataset API to feed in data. The task is NMT (seq2seq).

Issue

Estimator requires feature_columns as input for serving. My NMT task has only one feature, the input sentence to translate (or possibly each word in the sentence is a feature?). And so I am unsure how to build a feature_column (and thus an embedding_column and finally an input_layer) using my input sentences as a feature that can be fed into an RNN (which expects an embedding_lookup [batch_size, max_seqence_len, embedding_dim]) which will finally allow me to serve the Estimator.

Background

I am attempting to utilize a custom estimator to feed a seq2seq style NMT implementation. I need to be able to serve the model via tf-serving, which estimators seem to make relatively easy.

However I hit a road block with 'how' to serve the model. From what I can tell I need 'feature_columns' that will serve as the input into the model.

https://github.com/MtDersvan/tf_playground/blob/master/wide_and_deep_tutorial/wide_and_deep_basic_serving.md

Shows that you need to have an export_input_fn which uses a feature_spec which needs a feature_column(s) as input. This makes sense, however, for my use case I do not have a bunch of (different) features, instead I have input sentences (where each word is a feature) that need to be looked up via embeddings and used as features...

So I know I need the input into my model to be feature column(s). My input for NMT is simply a tensor of [batch_size, max_sequence_len] which is filled with the indices of the words from the sentences (e.g., for batch_size=1 [3, 17, 132, 2, 1, 0, ...] where each index should map to an embedding vector). Typically I would feed this into a embedding_lookup via

    embs = tf.get_variable('embedding', [vocab_size, embedding_dim])
    tf.nn.embedding_lookup(embs, inputs)

and I would be good to go, I could feed this to an RNN as inputs and the rest is history, not a problem.

BUT, this is where I hit the issue, I need to use feature_columns (so I can serve the model). The answer given to the question I mentioned at the beginning of this shows how to use embedding_column, but instead he is suggesting that embedding should look up an entire sentence as one single feature, but traditionally you would look up each word in the sentence and get its embedding.

Shows 'how to implement a feature-column in a custom estimator' and indeed his 'Before' code is exactly right (as I wrote out), a tf.get_variable into a tf.nn.embedding_lookup, but his 'after' code, again, only takes in 1 feature (the entire sentence?).

I have verified this by using their code and feeding my data in [batch_size, max_seq_len] to the tf.feature_column.categorical_column_with_identity, and the output tensor is [batch_size, embedding_dim]

the sequence information is lost? Or does it simply get flattened? when I print the output its size (?, embedding_dim) where ? is typically my batch_size.

EDIT: I have verified the shape is [batch_size, embedding_dim], it is not just flattened... So the sequence info is lost

I'm guessing it must be treating the input as 1 single input feature (thus the batch_size=1 ex [3, 17, 132, 2, 1, 0, ...] where each index maps to an embedding vector) would map to a single feature which is not what is wanted, we want each index to map to an embedding and the needed output is [batch_size, max_seq_len, embedding_dim].

It sounds like what I instead need, is not one categorical_column_with_*, but a max_seq_len amount of them (1 for each word in my sequence), does this sound right? Each word would be a feature for my model so I am leaning toward this being the correct approach, but this also has issues. I am using the Dataset API, so in my input_train_fn() I load my data from a file, and then use tf.data.Dataset.from_tensor_slices(data, labels) to split the data into tensors which I can then dataset.batch(batch_size).make_one_shot_iterator().get_next() to feed into my Estimator. I cannot iterator over each batch (Tesors are not iterable) so I cannot simply make 100 feature_columns for each input batch...

Does anyone have any idea how to do this? This embedding lookup is a very straightforward thing to do with simple placeholders or variables (and a common approach in NLP tasks). But when I venture into Dataset API and Estimators I run into a wall with very little in the way of information (that is not a basic example).

I admit I may have gaps in my understanding, custom estimators and dataset API are new to me and finding information on them can be difficult at times. So feel free to fire off information at me.

Thanks for reading my wall of text and hopefully helping me (and the others I've seen ask a question similar to this but get no answer https://groups.google.com/a/tensorflow.org/forum/#!searchin/discuss/embeddings$20in$20custom$20estimator/discuss/U3vFQF_jeaY/EjgwRQ3RDQAJ I feel bad for this guy, his question was not really answered (for the same reason outlined here, and his thread got hijacked...).

Solution

So the way I ended up making this work is I made each word an input feature, then I simply do the wrd_2_idx conversion, pass that in as a feature in a numerical_column(s, you have max_seq_lens of these) and then pass those columns to input_layer. Then in my graph I uses these features and lookup the embedding as normal. Basically circumventing the embedding_column lookup since I can't figure out how to make it act the way I want. This is probably not optimal but it works and trains...

I'll leave this as the accepted answer and hope sometime in the future either I figure out a better way to do it, or someone else can enlighten me to the best way to approach this.