Search code examples
mathsimilarity

How to calculate similarity of numbers (in list)


I am looking for a method for calculating similarity score for list of numbers. Ideally the method should give result in fixed range. For example from 0 to 1 where 0 is not similar at all and 1 means all numbers are identical.

For clarity let me provide a few examples:

0 1 2 3 4 5 6 7 8 9 10 => the similarity should be 0 or close to zero as all numbers are different
1 1 1 1 1 1 1 => 1
10 9 11 10.5 => close to 1
1 1 1 1 1 1 1 1 1 1 100 => score should be still pretty high as only the last value is different

I have tried to calculate the similarity based on normalization and average, but that gives me really bad results when there is one 'bad number'.

Thank you.


Solution

  • Similarity tests are always incredibly subjective, and the right one to use depends heavily on what you're trying to use it for. We already have three typical measures of central tendency (mean, median, mode). It's hard to say what test will work for you because there are different ways of measuring that will do what you're asking, but have wildly different measures for other lists (like [1]*7 + [100] * 7). Here's one solution:

    import statistics as stats
    
    def tester(ell):
        mode_measure = 1 - len(set(ell))/len(ell)
        avg_measure = 1 - stats.stdev(ell)/stats.mean(ell)
        return max(avg_measure, mode_measure)