I am currently using LogisticRegression from scikit-learn in the problem of multi-class categorization. I've decided to use LogisticRegression since I've read a couple of articles describing it as a well-calibrated algorithm in terms of prediction probabilities it returns.
For each result of the classifier I inspect its prediction probability as well as a distance between the classified observation and the rest of examples in the training set with the same decision class.
I'm surprised that for some of the results even though a class has been predicted with more than 90% confidence, the cosine similarity measure suggests that the given example is on average nearly orthogonal to the set of examples with the same class in the training set.
Can someone please provide some clue as to why such a discrepancy could be observed?
I'd expect that for the examples that are substantially distant from the rest of observations with the same class, the LogisticRegression algorithm would return low prediction probabilities.
Logistic regression / classification will provide results with respect to a decision boundary but there is no guarantee that points on the same side of the boundary (i.e., belonging to the same class) will have small cosine distances (or even small Euclidean distances).
Consider points in the x-y plane where all points below y=0 belong to one class and all points above belong to the other class. The points (-1000, 1) and (1000, 1) belong to the same class but have a relatively large cosine distance between them. On the other hand, the points (1000, 1) and (1000, -1) belong to different classes but have a very small cosine distance.