python tensorflow keras tensorflow-datasets

How to apply Keras Normalization to a ParallelMapDataset without making it eager?

I am training a Tensorflow Keras CNN over images, too much training data to fit into memory. I've got a tf.Dataset preprocessing pipeline that reads the images from HDF5 files using a dataset.map() pipeline step. Now I'm trying to normalize the numeric image data to 0 mean and unit variance.

I'm following this example from this guide, except that I have that .map() in there:

def load_features_from_hdf5(filename):
    spec = tf.TensorSpec(feature_shape, dtype=tf.dtypes.float32, name=None)
    dataset = tfio.IODataset.from_hdf5(filename, "/features", spec=spec)  # returns a Dataset
    feature = dataset.get_single_element()
    feature.set_shape(feature_shape)
    return feature

train_x = tf.data.Dataset.from_tensor_slices(filenames).map(load_features_from_fbank, num_parallel_calls=tf.data.AUTOTUNE)

normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(train_x.take(1000))

train_x_normalized = normalizer(train_x)  #  <-- ValueError

adapt() successfully computes the mean and variance from the dataset. But when I try to actually apply normalization of values on the exact same dataset, it errors while trying to convert my ParallelMapDataset to an EagerTensor.

ValueError: Attempt to convert a value (<ParallelMapDataset shapes: (41, 682, 1), types: tf.float32>) with an unsupported type (<class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>) to a Tensor.

How can I get this working? Since the data is so large, I wouldn't think I want to make anything eager until training starts. Should I make the normalization an explicit pipeline step on the Dataset? Or an explicit layer on the model itself? (If the latter case, how can I bring the mean and variance values from training time to inference time in another process?)

Solution

You could try something like this:

import tensorflow as tf

# Create dummy data
train_x = tf.data.Dataset.from_tensor_slices((tf.random.normal((100, 28, 28, 3)), tf.random.normal((100, 1)))).batch(10)

normalizer = tf.keras.layers.Normalization(axis=None)

# Adapt
normalizer.adapt(train_x.map(lambda x, y: x))

# Apply to images
train_x_normalized = train_x.map(lambda x, y: (normalizer(x), y))

Example:

for x, y in train_x_normalized.take(1):
  print(tf.reduce_mean(x), tf.math.reduce_variance(x))

tf.Tensor(0.00930768, shape=(), dtype=float32) tf.Tensor(1.0023469, shape=(), dtype=float32)

Or, as you mentioned in your question, your can use the normalization layer as part of your model.