python tensorflow machine-learning deep-learning tensorboard

How to detect vanishing and exploding gradients with Tensorboard?

I have two "sub-questions"

1) How can I detect vanishing or exploding gradients with Tensorboard, given the fact that currently write_grads=True is deprecated in the Tensorboard callback as per "un-deprecate write_grads for fit #31173" ?

2) I figured I can probably tell whether my model suffers from vanishing gradients based on the weights' distributions and histograms in the Distributions and Histograms tab in Tensorboard. My problem is that I have no frame of reference to compare with. Currently, my biases seem to be "moving" but I can't tell whether my kernel weights (Conv2D layers) are "moving"/"changing" "enough". Can someone help me by giving a rule of thumb to asses this visually in Tensorboard? I.e. if only the bottom 25% percentile of kernel weights are moving, that's good enough / not good enough? Or perhaps someone can post two reference images from tensorBoard of vanishing gradients vs, non vanishing gradients.

Here are my histograms and distributions, is it possible to tell whether my model suffers from Vanishing gradients? (some layers omitted for brevity) Thanks in advance.

Solution

I am currently facing the same question and approached the problem similarly using Tensorboard.

Even tho write_grads is deprecated you can still manage to log gradients for each layer of your network by subclassing the tf.keras.Model class and computing the gradients manually with gradient.Tape in the train_step method.

Something similar to this is working for me

from tensorflow.keras import Model

class TrainWithCustomLogsModel(Model):

    def __init__(self, **kwargs):
        super(TrainWithCustomLogsModel, self).__init__(**kwargs)
        self.step = tf.Variable(0, dtype=tf.int64,trainable=False)

    def train_step(self, data):

        # Get batch images and labels
        x, y = data
        
        # Compute the batch loss
        with tf.GradientTape() as tape:
            p = self(x , training = True)
            loss = self.compiled_loss(y, p, regularization_losses=self.losses)
        
        # Compute gradients for each weight of the network. Note trainable_vars and gradients are list of tensors
        trainable_vars = self.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)

        # Log gradients in Tensorboard
        self.step.assign_add(tf.constant(1, dtype=tf.int64))
        #tf.print(self.step)
        with train_summary_writer.as_default():
          for var, grad in zip(trainable_vars, gradients):
            name = var.name
            var, grad = tf.squeeze(var), tf.squeeze(grad)
            tf.summary.histogram(name, var, step = self.step)
            tf.summary.histogram('Gradients_'+name, grad, step = self.step)
    
        # Update model's weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))
        del tape
        # Update metrics (includes the metric that tracks the loss)
        self.compiled_metrics.update_state(y, p)
        # Return a dict mapping metric names to current value
        return {m.name: m.result() for m in self.metrics}

You should then be able to visualize distributions of your gradients for any train step of your training, along with distributions of your kernel's values.

Moreover, it might be worth try to plot the distribution of the norm through time instead of single values.