tensorflow google-cloud-ml tensorflow-datasets tensorflow-estimator

Getting free text features into Tensorflow Canned Estimators with Dataset API via feature_columns

I'm trying to build a model that gives reddit_score = f('subreddit','comment')

Mainly this is as an example i can then build on for a work project.

My code is here.

My problem is that i see that canned estimators e.g. DNNLinearCombinedRegressor must have feature_columns that are part of FeatureColumn class.

I have my vocab file and know that if i was to just limit to the first word of a comment i could just do something like

tf.feature_column.categorical_column_with_vocabulary_file(
        key='comment',
        vocabulary_file='{}/vocab.csv'.format(INPUT_DIR)
        )

But if i'm passing in say first 10 words from a comment then i'm not sure how to go from a string like "this is a pre padded 10 word comment xyzpadxyz xyzpadxyz" to a feature_column such that i can then build an embedding to pass to the deep features in a wide and deep model.

It seems like it must be something really obvious or simple but can't for life of me find any existing examples with this particular set up (canned wide and deep, dataset api, and a mix of features e.g subreddit and a raw text feature like comment).

I was even thinking about doing the vocab integer lookup myself such that the comment feature i pass in would be something like [23,45,67,12,1,345,7,99,999,999] and then maybe i could get it in via numeric_feature with a shape and then from there do something with it. But this feels a bit odd.

Solution

Adding answer as per approach from the post @Lak did, but adapted a little for dataset api.

# Create an input function reading a file using the Dataset API
# Then provide the results to the Estimator API
def read_dataset(prefix, mode, batch_size):

    def _input_fn():

        def decode_csv(value_column):

            columns = tf.decode_csv(value_column, field_delim='|', record_defaults=DEFAULTS)
            features = dict(zip(CSV_COLUMNS, columns))

            features['comment_words'] = tf.string_split([features['comment']])
            features['comment_words'] = tf.sparse_tensor_to_dense(features['comment_words'], default_value=PADWORD)
            features['comment_padding'] = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])
            features['comment_padded'] = tf.pad(features['comment_words'], features['comment_padding'])
            features['comment_sliced'] = tf.slice(features['comment_padded'], [0,0], [-1, MAX_DOCUMENT_LENGTH])
            features['comment_words'] = tf.pad(features['comment_sliced'], features['comment_padding'])
            features['comment_words'] = tf.slice(features['comment_words'],[0,0],[-1,MAX_DOCUMENT_LENGTH])

            features.pop('comment_padding')
            features.pop('comment_padded')
            features.pop('comment_sliced')

            label = features.pop(LABEL_COLUMN)

            return features, label

        # Use prefix to create file path
        file_path = '{}/{}*{}*'.format(INPUT_DIR, prefix, PATTERN)

        # Create list of files that match pattern
        file_list = tf.gfile.Glob(file_path)

        # Create dataset from file list
        dataset = (tf.data.TextLineDataset(file_list)  # Read text file
                    .map(decode_csv))  # Transform each elem by applying decode_csv fn

        tf.logging.info("...dataset.output_types={}".format(dataset.output_types))
        tf.logging.info("...dataset.output_shapes={}".format(dataset.output_shapes))

        if mode == tf.estimator.ModeKeys.TRAIN:

            num_epochs = None # indefinitely
            dataset = dataset.shuffle(buffer_size = 10 * batch_size)

        else:

            num_epochs = 1 # end-of-input after this

        dataset = dataset.repeat(num_epochs).batch(batch_size)

        return dataset.make_one_shot_iterator().get_next()

    return _input_fn

Then in below function we can reference the field we made as part of decode_csv() :

# Define feature columns
def get_wide_deep():

    EMBEDDING_SIZE = 10

    # Define column types
    subreddit = tf.feature_column.categorical_column_with_vocabulary_list('subreddit', ['news', 'ireland', 'pics'])

    comment_embeds = tf.feature_column.embedding_column(
        categorical_column = tf.feature_column.categorical_column_with_vocabulary_file(
            key='comment_words',
            vocabulary_file='{}/vocab.csv-00000-of-00001'.format(INPUT_DIR),
            vocabulary_size=100
            ),
        dimension = EMBEDDING_SIZE
        )

    # Sparse columns are wide, have a linear relationship with the output
    wide = [ subreddit ]

    # Continuous columns are deep, have a complex relationship with the output
    deep = [ comment_embeds ]

    return wide, deep