Search code examples
cluster-analysisdata-miningprecision-recall

Computing F-measure for clustering


Can anyone help me to calculate F-measure collectively ? I know how to calculate recall and precision, but don't know for a given algorithm how to calculate one F-measure value.

As an exemple, suppose my algorithm creates m clusters, but I know there are n clusters for the same data (as created by another benchmark algorithm).

I found one pdf but it is not useful since the collective value I got is greater than 1. Reference of pdf is F Measure explained. Specifically I have read some research paper, in which the author compares two algorithms on the basis of F-measure, they got collectively values between 0 and 1. if you read the pdf mentioned above carefully, the formula is F(C,K) = ∑ | ci | / N * max {F(ci,kj)}
where ci is reference cluster & kj is cluster created by other algorithm, here i is running from 1 to n & j is running from 1 to m.Let say |c1|=218 here as per pdf N=m*n let say m=12 and n=10, and we got max F(c1,kj) for j=2. Definitely F(c1,k2) is between 0 and 1. but the resultant value calculated by above formula we will get value above 1.


Solution

  • The paper Characterization and evaluation of similarity measures for pairs of clusterings by Darius Pfitzner, Richard Leibbrandt and David Powers contains a lot of useful information regarding this subject, including the following example:

    Given the set,

               D = {1, 2, 3, 4, 5, 6}
    

    and the partitions,

               P = {1, 2, 3}, {4, 5}, {6}, and
               Q = {1, 2, 4}, {3, 5, 6}
    

    where P is set created by our algorithm and Q is set created by standard algorithm we known

               PairsP = {(1, 2), (1, 3), (2, 3), (4, 5)},
               PairsQ = {(1, 2), (1, 4), (2, 4), (3, 5), (3, 6), (5, 6)}, and
               PairsD = {(1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 3), (2, 4),
                          (2, 5), (2, 6), (3, 4), (3, 5), (3, 6), (4, 5), (4, 6), (5, 6)}
    

    so,

               a = | PairsP intersection PairsQ | = |(1, 2)| = 1
               b = | PairsP- PairsQ | = |(1, 3)(2, 3)(4, 5)| = 3
               c = | PairsQ- PairsP  | = |(1, 4)(2, 4)(3, 5)(3, 6)(5, 6)| = 5
             
         F-measure= 2a/(2a+b+c)
    

    Note: There is an error in the publication on page 364 where a, b, c, and d are computed and the result of b and c are actually switched incorrectly. This switch would throw off the results of some other measures. Obviously, the F-measure is unaffected.