Other than numerically calculating, is there a quick way to get the derivative of covariance matrix (of my network activations)?
I'm trying to use it as a penalty term in my cost function in a deep neural network but in order to back-propagate the error through my layers I need to get the derivative.
in Matlab, if 'a' is the activation matrix (neurons*samples) of layer i and 'da' is the derivative of the activation function:
covariance = a * a' / (size(a,2)-1);
I've tried so far:
covarDelta = (da*a' + a*da' ) / (size(a,2)-1);
But strangely I've got much closer to the numerically calculated gradient when I derived as if aa' was in fact aa=a.^2 (doesn't make sense but it improved things a bit):
covarDelta = 2*a/size(a,1);
But none of them is correct. Any idea how else to approximate the derivative of covariance?
EDIT: I don't use the covariance matrix itself as a penalty term, I take the mean of all its elements and add that number to the cost function. I use this approach because I tried to come up with a penalty term that would be larger when there's more covariance overall between the signals.
NOTE:I aim to minimize similarity between the signals during the training (I'd also tried penalizing pair-wise mutual information but couldn't find a way to calculate the derivative of that either).
EDIT 2: I've finally used the same derivative provided by the accepted answer but I've changed the cost term to be mean(sqrt(x.^2)). This way both negative and positive covariance will increase the penalty and the derivative is the same.
Edit:
Suppose we only have one data point with three dimensions a = [a1 a2 a3]'
, because the sum of all the elements in the outer product matrix a*a'
is equivalent to the expansion of (a1+a2+a3)^2
, the mean of the matrix is (a1+a2+a3)^2/(3*3)
. So in this case, the derivative for every dimension is of the same value 2*(a1+a2+a3)/(3*3)
.
For more data points that term becomes ((a1+a2+a3)^2+(b1+b2+b3)^2+...)/(3*3)
, and the derivative is 2*(x1+x2+x3)/(3*3)
for data point x (same value for each dimension).
Simply taking the mean might not suit your needs because it will cancel out the positive and negative values in the covariance matrix.
Currently I don't have an environment to verify my answer, please correct me where I'm wrong.
Original Post:
Normally people would use a scalar value as the cost, instead of a (covariance) matrix.
If we denote covariance as a function cov(x), it takes a matrix as input and outputs a matrix.
So the exact derivative is not a single matrix, because its partial derivative with respect to every element of the input matrix is a matrix.
Say the dimension of the input matrix A is m*n, and the dimension of the output matrix C is then m*m. The derivative dA/dC should be a m*m*m*n matrix. See http://mplab.ucsd.edu/tutorials/MatrixRecipes.pdf for details of matrix-by-matrix differentiation.