Search code examples
machine-learningsetsimilaritymetricsprecision-recall

What is a metric to measure the similarity of any two sets


I seek a function that will assign a real number to any two sets based on their elements alone. I need it to be sensitive to the extent of the intersection of the sets but to penalize if the sets have extraneous items. In other words I want to count both recall and precision in the same metric.


Solution

  • What you are looking for is the Jaccard index:

    J(A, B) := |A ∩ B| / |A ∪ B|
    

    Hence it counts how many elements both sets have in common and divides it by the number of unique elements.

    J(A, B) is at maximum 1 if both sets are identical. The minimum is zero, if they don't share any element. You might want to assign a number for the case that both sets are empty.