tensorflow machine-learning deep-learning loss-function

Which loss-function is better than MSE in temperature prediction?

I have a feature vector size of 1x4098. Each feature vector corresponds to a float number (temperature). In training, I have 10.000 samples. Hence, I have training set size of 10000x4098 and the label of 10000x1. I want to use linear regression model to predict temperature from training data. i am using 3 hidden layers (512, 128, 32) with MSE loss. However, I only got 80% accuracy using tensorflow. Could you suggest to me others loss function to get better performance?

Solution

Let me give a rather theoretical explanation on the choice of loss function. As you may guess, it all depends on the data.

MSE has a nice probabilistic interpretation: it corresponds to MLE (maximum likelihood estimator) under assumption that the distribution p(y|x) is Gaussian: p(y|x) ~ N(mu, sigma). Since MLE converges to the true parameter value, this means is that under this assumption, the found minimum is very likely to be the best fit you can possibly get. Of course, you may find local instead of global minimum, there's also implicit assumption that your training data represent x distribution well. But this kind of uncertainty is inevitable, so realistically we just accept it.

Moving on, L1 loss (absolute difference) minimization is equivalent to MLE maximization under assumption that p(y|x) has Laplace distribution. And here's the same conclusion: if the data fits this distribution, no other loss will work better than L1 loss.

Huber loss doesn't have strict probability interpretation (at least I'm not aware of it), it's somewhat in between L1 and L2, closer to one or another depending on the choice of delta.

How does it help you in finding the right loss function? First of all, this means that no loss is by default superior than other. Secondly, the better you understand the data, the more you can be sure your choice of the loss function is correct. Of course, you can just cross-validate all of these options and select the best one. But here's a good reason to do this kind of analysis: when you are confident in data distribution, you will see steady improvement with adding new training data and increasing model complexity. Otherwise, it's simply possible that the model will never generalize.