This works as expected and returns 1 for one of the groups.
from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [6, 6, 6, 1, 2, 2]
metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)
(1.0, 0.6853314789615865, 0.8132898335036762)
But this returns 0.75 for all 3 groups while I expected "1.0" for one of the groups like the example mentioned above.
y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
labels = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2,
2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0]
metrics.homogeneity_completeness_v_measure(y, labels)
(0.7514854021988339, 0.7649861514489816, 0.7581756800057786)
Expected 1 in one of the groups above!
Update:
As you can see, one of the groups matches with the other and therefore one of the values should have been 1 instead of 0.75 accuracy that I get for all 3 groups. This is not expected!
from collections import Counter
Counter(y)
Counter(labels)
Counter({0: 50, 1: 50, 2: 50})
Counter({1: 50, 0: 62, 2: 38})
Firstly, homogeneity, completeness and v measure score are calculated as follows:
C and K are two random variables. In your case, C is the labels true, while K is the label predicted.
If h = 1
, it means that H(C|K) = 0
, as H(C)
is always less than 0. If H(C|K) = 0
, it means that random variable C is completely determined by given random variable K, you can see more detailed definition on conditional entropy. So in your first case, why h = 1? Because when I give a value of random variable K (labels predicted), I know what the random variable C (labels true) will be. If k equals 6, I know c is 0. If k is 1, c is 1, etc. So when talking about the second case, why h != 1 or c != 1. Because even though there is a perfect match between 1 to 0, but there are no perfect match for other classes. If I give k is 1, I know c is 0. But when I give k is 0, I'm not sure whether c is 1 or 2. Thus, the homogeneity score or in reverse, completeness score, you can think about that, will not be 1.