tensorflow tensorflow-serving google-cloud-ml-engine

Where to perform tf.string_split() operation

I am struggling with how to adapt the ML engine examples to use a long text string as an input feature. I am building a custom estimator (like this example) and am looking to understand "best practices" with an eye towards deploying the model -- I would like as many transformations as possible to be contained within the model_fn itself so that things are easier at serving time. Consider a .csv with a column "text" that contains sentences such as "This is a sentence" -- ultimately I need to split this text into tokens using tf.string_split() and then convert the individual tokens to indices (using a vocabulary file or similar) and then pass these to an embedding. Assuming "This is a sentence" is the sentence we are working with, one approach is outlined below. My question is whether this is the "optimal" way to accomplish this, or if there is a better way to do this.

Feature Columns

def get_feature_columns():
    sparse = tf.contrib.layers.sparse_column_with_integerized_feature(
        column_name = 'text',
        bucket_size = 100000,
        combiner = "sum")
    embedding = tf.contrib.layers.embedding_column(
        sparse, dimension = word_embedding_size)

    return set([embedding])

Input Function

def generate_input_fn():
    # read rows
    filename_queue = tf.train.string_input_producer(filenames, 
        num_epochs=num_epochs, shuffle=shuffle, capacity=32)

    reader = tf.TextLineReader(skip_header_lines=skip_header_lines)

    _, rows = reader.read_up_to(filename_queue, num_records=batch_size)

    text, label = tf.decode_csv(rows, record_defaults=[[""], [""]])

    # transform text from sentence --> tokens --> integerized tokens
    text_index = tf.contrib.lookup.index_table_from_file(vocabulary_file = vocab_file)
    tokens = tf.string_split(text)
    tokens_idx = text_index.lookup(tokens)

    features = dict(zip(['text', 'label'], [tokens_idx, label]))

    features = tf.train.shuffle_batch(features,batch_size)

    return features, features.pop('label')

Model FN: There is other context, but in general this is fed in via..

input_layer = tf.feature_column.input_layer(features, 
        feature_columns = get_feature_columns())

I recognize that one approach is to do the splitting and indexing ahead of time, but that is not currently possible due to how I am accessing the .csv's. My issue with this approach is that I feel like the transformations should all be handled within get_feature_columns() -- is it "best practices" to handle the transformation within the input function before being sent to the model or should I be trying to find a way to perform either the split or the lookup within the model itself?

My worry is that I will now need a separate serving_input_fn() that will need to perform the same transformations that appear in the current input_fn(), but these could easily get out of sync if a change is made to one and not the other. Are there any other options that could be suggested to avoid this problem?

Solution

The division between what happens in the input_fn and what happens in the model_fn is entirely determined by what behavior you want at inference time. As a general rule of thumb:

If you need to perform a transformation on both the training input and the prediction input, put it in the model_fn
If you need to preform a transformation only on either the training input or prediction input, put it in the corresponding input_fn (serving or training/eval)

This is only a rule for convenience, it will work either way. But often you want to put as much preprocessing outside the graph as possible for training/eval, so you don't duplicate the compute time of the preprocessing when you train for multiple epochs, or try a new model architecture. However, you then want to put as much of that preprocessing inside the graph as possible for inference, since it will (generally) be more efficient from a latency perspective than proxying.

Hope this clears things up.