Search code examples
tensorflowtensorflow-servinggoogle-cloud-ml-engine

Where to perform tf.string_split() operation


I am struggling with how to adapt the ML engine examples to use a long text string as an input feature. I am building a custom estimator (like this example) and am looking to understand "best practices" with an eye towards deploying the model -- I would like as many transformations as possible to be contained within the model_fn itself so that things are easier at serving time. Consider a .csv with a column "text" that contains sentences such as "This is a sentence" -- ultimately I need to split this text into tokens using tf.string_split() and then convert the individual tokens to indices (using a vocabulary file or similar) and then pass these to an embedding. Assuming "This is a sentence" is the sentence we are working with, one approach is outlined below. My question is whether this is the "optimal" way to accomplish this, or if there is a better way to do this.

Feature Columns

def get_feature_columns():
    sparse = tf.contrib.layers.sparse_column_with_integerized_feature(
        column_name = 'text',
        bucket_size = 100000,
        combiner = "sum")
    embedding = tf.contrib.layers.embedding_column(
        sparse, dimension = word_embedding_size)

    return set([embedding])

Input Function

def generate_input_fn():
    # read rows
    filename_queue = tf.train.string_input_producer(filenames, 
        num_epochs=num_epochs, shuffle=shuffle, capacity=32)

    reader = tf.TextLineReader(skip_header_lines=skip_header_lines)

    _, rows = reader.read_up_to(filename_queue, num_records=batch_size)

    text, label = tf.decode_csv(rows, record_defaults=[[""], [""]])

    # transform text from sentence --> tokens --> integerized tokens
    text_index = tf.contrib.lookup.index_table_from_file(vocabulary_file = vocab_file)
    tokens = tf.string_split(text)
    tokens_idx = text_index.lookup(tokens)

    features = dict(zip(['text', 'label'], [tokens_idx, label]))

    features = tf.train.shuffle_batch(features,batch_size)

    return features, features.pop('label')

Model FN: There is other context, but in general this is fed in via..

input_layer = tf.feature_column.input_layer(features, 
        feature_columns = get_feature_columns())

I recognize that one approach is to do the splitting and indexing ahead of time, but that is not currently possible due to how I am accessing the .csv's. My issue with this approach is that I feel like the transformations should all be handled within get_feature_columns() -- is it "best practices" to handle the transformation within the input function before being sent to the model or should I be trying to find a way to perform either the split or the lookup within the model itself?

My worry is that I will now need a separate serving_input_fn() that will need to perform the same transformations that appear in the current input_fn(), but these could easily get out of sync if a change is made to one and not the other. Are there any other options that could be suggested to avoid this problem?


Solution

  • The division between what happens in the input_fn and what happens in the model_fn is entirely determined by what behavior you want at inference time. As a general rule of thumb:

    • If you need to perform a transformation on both the training input and the prediction input, put it in the model_fn
    • If you need to preform a transformation only on either the training input or prediction input, put it in the corresponding input_fn (serving or training/eval)

    This is only a rule for convenience, it will work either way. But often you want to put as much preprocessing outside the graph as possible for training/eval, so you don't duplicate the compute time of the preprocessing when you train for multiple epochs, or try a new model architecture. However, you then want to put as much of that preprocessing inside the graph as possible for inference, since it will (generally) be more efficient from a latency perspective than proxying.

    Hope this clears things up.