I know this has probably been addressed many a time, but I'm constantly hearing conflicting points and I'm uncertain as to how I should go about computing the loss function, and moreover, how to compute the gradient over a mini-batch.
Let's say I have a simple linear regression ANN model with one input, one output and no activation function. The weight matrix (W) and bias matrix (B) then just have the shape (1, 1). If I batch my data into mini-batches of size 32, then the input matrix (X) will have dimensions (1, 32). Forward prop is then performed without a hitch: it's just W.X + B which works fine because the shapes are compatible. It then produces the predictions which can be denoted by matrix Yi with shape (1, 32). The cost is then computed as the mean squared error of the model outputs and the truth values. In this case, there's only one output, so the cost over one training example is just (truth - predicted)2.
So I'm confused about a couple of aspects at this point. Do you a) compute the average cost over a mini-batch, then compute the derivative of the averaged cost w.r.t to the weight and bias; or b) calculate individual costs for each example in the mini-batch and then compute the derivatives of the costs w.r.t the weight and bias, and then finally sum the gradients and average them?
Since gradient is a linear operator, grad((cost(x1)+...+cost(xn))/n)=(grad(cost(x1))+...grad(cost(xn)))/n.