Search code examples
cluster-analysisprecisionprecision-recall

recall and precision for multi class clustering


I have difficulties to understand how to measure precision and recall for multi class clustering. Here is an example with 9 elements:

considering the following ground truth:

A,B,C,D
E,F,G
H,I

and the following observed clustering:

A,B,C
D
E,F,G,H,I

how should I calculate the number of true positives (TP), false positives (FP) and false negatives (FN) ?

my naive approach has been to consider all pairs of elements:

TP = 7 (A-B, A-C, B-C, E-F, E-G, F-G, H-I)
FP = 6 (E-H, E-I, F-H, F-I, G-H, G-I)
FN = 3 (A-D, B-D, C-D)

Is it the right way of doing it ?

Thanks


Solution

  • Yes, TP etc. look good to me at first sight.

    But enumerating all pairs is slow.

    You can do better: you can directly compute the number of pairs from a cross tabulation matrix.

    There should be TP=3*2/2+3*2/2+2*1/2=7

    FN=3*2/2+5*4/2-TP=13-7=6

    FP=4*3/2+3*2/2+2*1/2-TP=10-7=3

    etc.

    But then rather compute Adjusted Rand Index (ARI). Because you want a measure where a random result only scores close to 0. With precision and recall, results tend to appear much better than they are.