Search code examples
javagroovycomparisonsimilarity

Similarity between Sets containing Integers in java or groovy


I have HashSet<Integer> A and B I want to compare to get a numeric value how similar they are (e.g. 0.9 if 90% of A and B are the same). What is the best (fastest) way to do this in java or groovy?

My naive way to do this is to get all equal elements from A and B and divide the size of them through the original size of A. Is there any reason (speed e.g.) why this wouldn't work properly? Generally speaking I would prefer any already implemented way to get the similarity.

Note: Comparing 1, 2 to 12 should be 0% similarity.


Solution

  • Like Adam suggests, a loop is the most efficient way to find the size of the intersection

    public static int intersectionsCount(Set set1, Set set2) {
        if (set2.size() < set1.size()) return intersectionsCount(set2, set1);
        int count = 0;
        for (Object o : set1)
            if (set2.contains(o)) count++;
        return count;
    }
    
    public static double commonRatio(Set set1, Set set2) {
        int common = intersectionsCount(set1, set2);
        int union = set1.size() + set2.size() - common;
        return (double) common / union; // [0.0, 1.0]
    }