machine-learning neural-network gradient covariance derivative

How do I calculate the derivative / gradient of covariance?

Other than numerically calculating, is there a quick way to get the derivative of covariance matrix (of my network activations)?

I'm trying to use it as a penalty term in my cost function in a deep neural network but in order to back-propagate the error through my layers I need to get the derivative.

in Matlab, if 'a' is the activation matrix (neurons*samples) of layer i and 'da' is the derivative of the activation function:

covariance = a * a' / (size(a,2)-1);

I've tried so far:

covarDelta = (da*a' + a*da' ) / (size(a,2)-1);

But strangely I've got much closer to the numerically calculated gradient when I derived as if aa' was in fact aa=a.^2 (doesn't make sense but it improved things a bit):

covarDelta = 2*a/size(a,1);

But none of them is correct. Any idea how else to approximate the derivative of covariance?

EDIT: I don't use the covariance matrix itself as a penalty term, I take the mean of all its elements and add that number to the cost function. I use this approach because I tried to come up with a penalty term that would be larger when there's more covariance overall between the signals.

NOTE:I aim to minimize similarity between the signals during the training (I'd also tried penalizing pair-wise mutual information but couldn't find a way to calculate the derivative of that either).

EDIT 2: I've finally used the same derivative provided by the accepted answer but I've changed the cost term to be mean(sqrt(x.^2)). This way both negative and positive covariance will increase the penalty and the derivative is the same.

Solution

Edit:

Suppose we only have one data point with three dimensions a = [a1 a2 a3]', because the sum of all the elements in the outer product matrix a*a' is equivalent to the expansion of (a1+a2+a3)^2, the mean of the matrix is (a1+a2+a3)^2/(3*3). So in this case, the derivative for every dimension is of the same value 2*(a1+a2+a3)/(3*3).

For more data points that term becomes ((a1+a2+a3)^2+(b1+b2+b3)^2+...)/(3*3), and the derivative is 2*(x1+x2+x3)/(3*3) for data point x (same value for each dimension).

Simply taking the mean might not suit your needs because it will cancel out the positive and negative values in the covariance matrix.

Currently I don't have an environment to verify my answer, please correct me where I'm wrong.

Original Post:

Normally people would use a scalar value as the cost, instead of a (covariance) matrix.

If we denote covariance as a function cov(x), it takes a matrix as input and outputs a matrix.

So the exact derivative is not a single matrix, because its partial derivative with respect to every element of the input matrix is a matrix.

Say the dimension of the input matrix A is m*n, and the dimension of the output matrix C is then m*m. The derivative dA/dC should be a m*m*m*n matrix. See http://mplab.ucsd.edu/tutorials/MatrixRecipes.pdf for details of matrix-by-matrix differentiation.