Entropy and information relation

In terms of compression and information theory, the entropy of a source is the average amount of information (in bits) that symbols from the source can convey. Informally speaking, if we are certain about the outcome of an event then entropy is less.

J. Principe, D. Xu, and J. Fisher, “Information theoretic learning,” in Unsupervised Adaptive Filtering, S. Haykin, Ed. New York: Wiley, 2000, vol. I, pp. 265–319.

Entropy (Shannon and Renyis) has been used in learning by minimizing the entropy of the error as an objective function instead of the Mean Square error.

My questions are

What is the rationale for minimizing entropy of error? When entropy is maximum, what can we say about the information? Thank you

Solution

This is a better fit at CS Stack Overflow, probably, but as long as we have a computer science tag I am unwilling to downvote it. (Note: NOT CS Theory Stack Overflow, that's for research grade discussions, which this is not. They will downvote and close immediately.)

Anyway, the intuitive answer is almost exactly as you have said: As you minimize the entropy of something, you are increasing your ability to predict it. If you minimize the entropy of an error between model and results, you are saying that you are increasing the predictive power of the model.

To sharpen this intuition mathematically, go forth and study things like the Expectation Maximization algorithm until you have internalized it. If you find EM hard-going, then go forth and study things like Bayesian Probability until EM makes sense.