matlab neural-network vectorization octave backpropagation

Why is this the correct way to do a cost function for a neural network?

So after beating my head against the wall for a few hours, I looked online for a solution to my problem, and it worked great. I just want to know what caused the issue with the way I was originally going about it.

here are some more details. The input is a 20x20px image from the MNIST datset, and there are 5000 samples, so X, or A1 is 5000x400. There are 25 nodes in the single hidden layer. The output is a one hot vector of 0-9 digits. y (not Y, which is the one hot encoding of y) is a 5000x1 vector with the value of 1-10.

Here was my original code for the cost function:

Y = zeros(m, num_labels);
   for i = 1:m
   Y(i, y(i)) = 1; 
endfor
H = sigmoid(Theta2*[ones(1,m);sigmoid(Theta1*[ones(m, 1) X]'))
J = (1/m) * sum(sum((-Y*log(H]))' - (1-Y)*log(1-H]))')))

But then I found this:

A1 = [ones(m, 1) X];
Z2 = A1 * Theta1';
A2 = [ones(size(Z2, 1), 1) sigmoid(Z2)];
Z3 = A2*Theta2';
H = A3 = sigmoid(Z3);

J = (1/m)*sum(sum((-Y).*log(H) - (1-Y).*log(1-H), 2));

I see that this may be slightly cleaner, but what functionally causes my original code to get 304.88 and the other to get ~ 0.25? Is it the element wise multiplication?

FYI, this is the same problem as this question if you need the formal equation written out.

Thanks for any help I can get! I really want to understand where I'm going wrong

Solution

Transfer from the comments:
With a quick look, in J = (1/m) * sum(sum((-Y*log(H]))' - (1-Y)*log(1-H]))'))) there is definetely something going on with the parenthesis, but probably on how you pasted it here, not with the original code as this would throw an error when you run it. If I understand correctly and Y, H are matrices, then in your 1st version Y*log(H) is matrix multiplication while in the 2nd version Y.*log(H) is an entrywise multiplication (not matrix-multiplication, just c(i,j)=a(i,j)*b(i,j) ).

Update 1:
In regards to your question in the comment. From the first screenshot, you represent each value yk(i) in the entry Y(i,k) of the Y matrix and each value h(x^(i))k as H(i,k). So basically, for each i,k you want to compute Y(i,k) log(H(i,k)) + (1-Y(i,k)) log(1-H(i,k)). You can do it for all the values together and store the result in matrix C. Then C = Y.*log(H) + (1-Y).*log(1-H) and each C(i,k) has the above mentioned value. This is an operation .* because you want to do the operation for each element (i,k) of each matrix (in contrast to multiplying the matrices which is totally different). Afterwards, to get the sum of all the values inside the 2D dimensional matrix C, you use the octave function sum twice: sum(sum(C)) to sum both columnwise and row-wise (or as @ Irreducible suggested, just sum(C(:))).

Note there may be other errors as well.