Search code examples
pythontensorflowkerastensorflow-datasets

How to apply Keras Normalization to a ParallelMapDataset without making it eager?


I am training a Tensorflow Keras CNN over images, too much training data to fit into memory. I've got a tf.Dataset preprocessing pipeline that reads the images from HDF5 files using a dataset.map() pipeline step. Now I'm trying to normalize the numeric image data to 0 mean and unit variance.

I'm following this example from this guide, except that I have that .map() in there:

def load_features_from_hdf5(filename):
    spec = tf.TensorSpec(feature_shape, dtype=tf.dtypes.float32, name=None)
    dataset = tfio.IODataset.from_hdf5(filename, "/features", spec=spec)  # returns a Dataset
    feature = dataset.get_single_element()
    feature.set_shape(feature_shape)
    return feature

train_x = tf.data.Dataset.from_tensor_slices(filenames).map(load_features_from_fbank, num_parallel_calls=tf.data.AUTOTUNE)

normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(train_x.take(1000))

train_x_normalized = normalizer(train_x)  #  <-- ValueError

adapt() successfully computes the mean and variance from the dataset. But when I try to actually apply normalization of values on the exact same dataset, it errors while trying to convert my ParallelMapDataset to an EagerTensor.

ValueError: Attempt to convert a value (<ParallelMapDataset shapes: (41, 682, 1), types: tf.float32>) with an unsupported type (<class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>) to a Tensor.

How can I get this working? Since the data is so large, I wouldn't think I want to make anything eager until training starts. Should I make the normalization an explicit pipeline step on the Dataset? Or an explicit layer on the model itself? (If the latter case, how can I bring the mean and variance values from training time to inference time in another process?)


Solution

  • You could try something like this:

    import tensorflow as tf
    
    # Create dummy data
    train_x = tf.data.Dataset.from_tensor_slices((tf.random.normal((100, 28, 28, 3)), tf.random.normal((100, 1)))).batch(10)
    
    normalizer = tf.keras.layers.Normalization(axis=None)
    
    # Adapt
    normalizer.adapt(train_x.map(lambda x, y: x))
    
    # Apply to images
    train_x_normalized = train_x.map(lambda x, y: (normalizer(x), y))  
    

    Example:

    for x, y in train_x_normalized.take(1):
      print(tf.reduce_mean(x), tf.math.reduce_variance(x))
    
    tf.Tensor(0.00930768, shape=(), dtype=float32) tf.Tensor(1.0023469, shape=(), dtype=float32)
    

    Or, as you mentioned in your question, your can use the normalization layer as part of your model.