python machine-learning keras neural-network dimensionality-reduction

How to use a learned embedding layer from a Keras ANN as an input feature in an XGBoost model?

I am attempting to reduce the dimensionality of a categorical feature by extracting an embedding layer from a neural net and using it as an input feature in a separate XGBoost model.

An embedding layer has the dimensions (nr. unique categories + 1, chosen output size). How can it be concatenated to the continuous variables in the original training data with the dimensions (nr. observations, nr. features)?

Below is a reproducible example of regression with a neural net, in which a categorical feature is encoded as a learned embedding layer. The example is closely adapted from: http://machinelearningmechanic.com/keras/2018/03/09/keras-regression-with-categorical-variable-embeddings-md.html#Define-the-input-layers

At the end I have printed the embedding layer and its shape. How can this layer be merged with the continuous features in the original training data (X_train_continuous)? If the number of rows were equal to the number of categories and if we knew the order in which categories are represented in the embedding layer, the embedding array could perhaps be joined to the training observations on category, but instead the number of rows equals the number of categories + 1 (in the code: len(values) + 1).

# Imports and helper functions

import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import keras
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization
from keras.layers import Input, Embedding, Dense
from keras.models import Model
from keras.callbacks import Callback
import matplotlib.pyplot as plt

# Bayesian Methods for Hackers style sheet
plt.style.use('bmh')

np.random.seed(1234567890)


class PeriodicLogger(Callback):
    """
    A helper callback class that only prints the losses once in 'display' epochs
    """

    def __init__(self, display=100):
        self.display = display

    def on_train_begin(self, logs={}):
        self.epochs = 0

    def on_epoch_end(self, batch, logs={}):
        self.epochs += 1
        if self.epochs % self.display == 0:
            print("Epoch: %d - loss: %f - val_loss: %f" % (
            self.epochs, logs['loss'], logs['val_loss']))


periodic_logger_250 = PeriodicLogger(250)

# Define the mapping and a function that computes the house price for each
# example

per_meter_mapping = {
    'Mercaz': 500,
    'Old North': 350,
    'Florentine': 230
}

per_room_additional_price = {
    'Mercaz': 15. * 10 ** 4,
    'Old North': 8. * 10 ** 4,
    'Florentine': 5. * 10 ** 4
}


def house_price_func(row):
    """
    house_price_func is the function f(a,s,n).

    :param row: dict (contains the keys: ['area', 'size', 'n_rooms'])
    :return: float
    """
    area, size, n_rooms = row['area'], row['size'], row['n_rooms']
    return size * per_meter_mapping[area] + n_rooms * \
           per_room_additional_price[area]

# Create toy data

AREAS = ['Mercaz', 'Old North', 'Florentine']


def create_samples(n_samples):
    """
    Helper method that creates dataset DataFrames

    Note that the np.random.choice call only determines the number of rooms and the size of the house
    (the price, which we calculate later, is deterministic)

    :param n_samples: int (number of samples for each area (suburb))
    :return: pd.DataFrame
    """
    samples = []

    for n_rooms in np.random.choice(range(1, 6), n_samples):
        samples += [(area, int(np.random.normal(25, 5)), n_rooms) for area in
                    AREAS]

    return pd.DataFrame(samples, columns=['area', 'size', 'n_rooms'])

# Create the train and validation sets

train = create_samples(n_samples=1000)
val = create_samples(n_samples=100)

# Calculate the prices for each set

train['price'] = train.apply(house_price_func, axis=1)
val['price'] = val.apply(house_price_func, axis=1)

# Define the features and the y vectors

continuous_cols = ['size', 'n_rooms']
categorical_cols = ['area']
y_col = ['price']

X_train_continuous = train[continuous_cols]
X_train_categorical = train[categorical_cols]
y_train = train[y_col]

X_val_continuous = val[continuous_cols]
X_val_categorical = val[categorical_cols]
y_val = val[y_col]

# Normalization

# Normalizing both train and test sets to have 0 mean and std. of 1 using the
# train set mean and std.
# This will give each feature an equal initial importance and speed up the
# training time

train_mean = X_train_continuous.mean(axis=0)
train_std = X_train_continuous.std(axis=0)

X_train_continuous = X_train_continuous - train_mean
X_train_continuous /= train_std

X_val_continuous = X_val_continuous - train_mean
X_val_continuous /= train_std

# Build a model using a categorical variable
# First let's define a helper class for the categorical variable

class EmbeddingMapping():
    """
    Helper class for handling categorical variables

    An instance of this class should be defined for each categorical variable
    we want to use.
    """

    def __init__(self, series):
        # get a list of unique values
        values = series.unique().tolist()

        # Set a dictionary mapping from values to integer value
        # In our example this will be {'Mercaz': 1, 'Old North': 2,
        # 'Florentine': 3}
        self.embedding_dict = {value: int_value + 1 for int_value, value in
                               enumerate(values)}

        # The num_values will be used as the input_dim when defining the
        # embedding layer.
        # It will also be returned for unseen values
        self.num_values = len(values) + 1

    def get_mapping(self, value):
        # If the value was seen in the training set, return its integer mapping
        if value in self.embedding_dict:
            return self.embedding_dict[value]

        # Else, return the same integer for unseen values
        else:
            return self.num_values

# Create an embedding column for the train/validation sets

area_mapping = EmbeddingMapping(X_train_categorical['area'])

X_train_categorical = \
    X_train_categorical.assign(area_mapping=X_train_categorical['area']
                               .apply(area_mapping.get_mapping))
X_val_categorical = \
    X_val_categorical.assign(area_mapping=X_val_categorical['area']
                             .apply(area_mapping.get_mapping))

# Define the input layers

# Define the embedding input
area_input = Input(shape=(1,), dtype='int32')

# Decide to what vector size we want to map our 'area' variable.
# I'll use 1 here because we only have three areas
embeddings_output = 2

# Let’s define the embedding layer and flatten it
area_embedings = Embedding(output_dim=embeddings_output,
                           input_dim=area_mapping.num_values,
                           input_length=1, name="embedding_layer")(area_input)
area_embedings = keras.layers.Reshape((embeddings_output,))(area_embedings)

# Define the continuous variables input (just like before)
continuous_input = Input(shape=(X_train_continuous.shape[1], ))

# Concatenate continuous and embeddings inputs
all_input = keras.layers.concatenate([continuous_input, area_embedings])

# To merge them together we will use Keras Functional API
# Will define a simple model with 2 hidden layers, with 25 neurons each.

# Define the model
units=25
dense1 = Dense(units=units, activation='relu')(all_input)
dense2 = Dense(units, activation='relu')(dense1)
predictions = Dense(1)(dense2)

# Note using the input object 'area_input' not 'area_embeddings'
model = Model(inputs=[continuous_input, area_input], outputs=predictions)

# Lets train the model

epochs = 100  # to train properly, use 10000
model.compile(loss='mse',
              optimizer=keras.optimizers.Adam(lr=.8, beta_1=0.9,
                                              beta_2=0.999, decay=1e-03,
                                              amsgrad=True))

# Note continuous and categorical columns are inserted in the same order as
# defined in all_inputs
history = model.fit([X_train_continuous, X_train_categorical['area_mapping']],
                    y_train, epochs=epochs, batch_size=128, callbacks=[
        periodic_logger_250], verbose=0,
                    validation_data=([X_val_continuous, X_val_categorical[
                        'area_mapping']], y_val))

# Observe the embedding layer

embeddings_output = model.get_layer('embedding_layer').get_weights()[0]

print(f'Embedding layer:\n{embeddings_output}')
print(f'Embedding layer shape: {embeddings_output.shape}')

Solution

First, this post has a terminology problem: an "embedding" is the representation of a particular input sample. It is the vector output by a layer. The "weights" are the matrices stored and trained inside the layer.

In Keras, the Model class is a subclass of Layer. You can use any Model as a Layer in a larger model.

You can create a Model with just the Embedding layer, then use it as a layer when building the rest of your model. After training, you can call .predict() on that "sub-model". Also, you can save that sub-model out to a json file and reload it later.

This is the standard technique for creating a model that emits internal embeddings.