I was trying to use a custom activation in mixed-precision enabled training pipelines but faced the following error:
TypeError: Input 'y' of 'Mul' Op has type float32 that does not match type float16 of argument 'x'.
Enabling Mixed precision...
import tensorflow as tf
policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy)
print('Mixed precision enabled')
Custom activation...
def ARelu(x, alpha=0.90, beta=2.0):
alpha = tf.clip_by_value(alpha, clip_value_min=0.01, clip_value_max=0.99)
beta = 1 + tf.math.sigmoid(beta)
return tf.nn.relu(x) * beta - tf.nn.relu(-x) * alpha
Training...
import tensorflow as tf
(xtrain, ytrain), (xtest, ytest) = tf.keras.datasets.mnist.load_data()
def pre_process(inputs, targets):
inputs = tf.expand_dims(inputs, -1)
targets = tf.one_hot(targets, depth=10)
return tf.divide(inputs, 255), targets
train_data = tf.data.Dataset.from_tensor_slices((xtrain, ytrain)).\
take(10_000).shuffle(10_000).batch(8).map(pre_process)
test_data = tf.data.Dataset.from_tensor_slices((xtest, ytest)).\
take(1_000).shuffle(1_000).batch(8).map(pre_process)
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(filters=16, kernel_size=(3, 3), strides=(1, 1),
input_shape=(28, 28, 1), activation=ARelu),
tf.keras.layers.MaxPool2D(pool_size=(2, 2)),
tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), strides=(1, 1),
activation=ARelu),
tf.keras.layers.MaxPool2D(pool_size=(2, 2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation=ARelu),
tf.keras.layers.Dense(10, activation='softmax', dtype=tf.float32)])
opt = tf.keras.optimizers.Adam()
model.compile(loss='categorical_crossentropy', optimizer=opt)
history = model.fit(train_data, validation_data=test_data, epochs=10)
# ------------------
TypeError: Input 'y' of 'Mul' Op has type float32 that does not match type float16 of argument 'x'.
However, without mixed-precision, it works. I understand the problem simply types miss match but where I should look into it?
Additionally, while trying to solve it, I've found that using tf.keras.mixed_precision.LossScaleOptimizer
is safe to avoid numeric underflow. Is it something that we should use for mixed-precision training?
The solution of the above problem is casting your defined alpha and beta into float16
rather than casting the input of your activation layer to Float32
.
DETAILS:
In reality, the main reason for using MP is to reduce the memory footprint observed during training. The method for doing so is by storing the output of the layer in a FP16
, since memory consumption is dominated by the storage of activations rather than weights. By recasting your layer output to FP32
in the custom activation function, you are losing these savings and even requiring more memory to train the model compared to using Full precision because 2 copies exist for your activation.