I am struggling with how to adapt the ML engine examples to use a long text string as an input feature. I am building a custom estimator (like this example) and am looking to understand "best practices" with an eye towards deploying the model -- I would like as many transformations as possible to be contained within the model_fn itself so that things are easier at serving time. Consider a .csv with a column "text" that contains sentences such as "This is a sentence" -- ultimately I need to split this text into tokens using tf.string_split() and then convert the individual tokens to indices (using a vocabulary file or similar) and then pass these to an embedding. Assuming "This is a sentence" is the sentence we are working with, one approach is outlined below. My question is whether this is the "optimal" way to accomplish this, or if there is a better way to do this.
Feature Columns
def get_feature_columns():
sparse = tf.contrib.layers.sparse_column_with_integerized_feature(
column_name = 'text',
bucket_size = 100000,
combiner = "sum")
embedding = tf.contrib.layers.embedding_column(
sparse, dimension = word_embedding_size)
return set([embedding])
Input Function
def generate_input_fn():
# read rows
filename_queue = tf.train.string_input_producer(filenames,
num_epochs=num_epochs, shuffle=shuffle, capacity=32)
reader = tf.TextLineReader(skip_header_lines=skip_header_lines)
_, rows = reader.read_up_to(filename_queue, num_records=batch_size)
text, label = tf.decode_csv(rows, record_defaults=[[""], [""]])
# transform text from sentence --> tokens --> integerized tokens
text_index = tf.contrib.lookup.index_table_from_file(vocabulary_file = vocab_file)
tokens = tf.string_split(text)
tokens_idx = text_index.lookup(tokens)
features = dict(zip(['text', 'label'], [tokens_idx, label]))
features = tf.train.shuffle_batch(features,batch_size)
return features, features.pop('label')
Model FN: There is other context, but in general this is fed in via..
input_layer = tf.feature_column.input_layer(features,
feature_columns = get_feature_columns())
I recognize that one approach is to do the splitting and indexing ahead of time, but that is not currently possible due to how I am accessing the .csv's. My issue with this approach is that I feel like the transformations should all be handled within get_feature_columns()
-- is it "best practices" to handle the transformation within the input function before being sent to the model or should I be trying to find a way to perform either the split or the lookup within the model itself?
My worry is that I will now need a separate serving_input_fn()
that will need to perform the same transformations that appear in the current input_fn()
, but these could easily get out of sync if a change is made to one and not the other. Are there any other options that could be suggested to avoid this problem?
The division between what happens in the input_fn
and what happens in the model_fn
is entirely determined by what behavior you want at inference time. As a general rule of thumb:
model_fn
input_fn
(serving or training/eval)This is only a rule for convenience, it will work either way. But often you want to put as much preprocessing outside the graph as possible for training/eval, so you don't duplicate the compute time of the preprocessing when you train for multiple epochs, or try a new model architecture. However, you then want to put as much of that preprocessing inside the graph as possible for inference, since it will (generally) be more efficient from a latency perspective than proxying.
Hope this clears things up.