Search code examples
cluster-analysisprecision-recall

How to compute precision and recall in clustering?


I am really confused how to compute precision and recall in clustering applications.

I have the following situation:

Given two sets A and B. By using a unique key for each element I can determine which of the elements of A and B match. I want to cluster those elements based on features (not using the unique key of course).

I am doing the clustering but I am not sure how to compute precision and recall. The formulas,according to the paper "Extended Performance Graphs for Cluster Retrieval" (http://staff.science.uva.nl/~nicu/publications/CVPR01_nies.pdf) are:

p = precision = relevant retrieved items/retrieved items and r = recall = relevant retrieved items/relevant items

I really do not get what elements fall under which category.

What I did so far is, I checked within the clusters how many matching pairs I have (using the unique key). Is that already one of precision or recall? And if so, which one is it and how can I compute the other one?

Update: I just found another paper with the title "An F-Measure for Evaluation of Unsupervised Clustering with Non-Determined Number of Clusters" at http://mtg.upf.edu/files/publications/unsuperf.pdf.


Solution

  • I think you'll find wikipedia has a helpful article on precision and recall. In short:

    Precision = true positives / (true positives + false positives)

    Recall = true positives /( true positivies + false negatives)