java algorithm machine-learning data-mining entropy

Real world Algorithm - Measuring uniqueness of input values

I have a list of key value pairs. For each key, I want to see how unique the values are. For example, for a particular key k1, all the values might be the same. (best case). For a key k2, half of the values are one type and the other half are different. . . Similarly, for a key kx, none of the values match (worst case).

I want to give ranks (or percentages, whatever) to each of these keys based on the above and have a final ordering, so that I can filter out those which have many different values (lets say above a predefined threshold rank or percent).

I somehow think this is somewhat related to some concepts I learned in my data mining course, but just cannot recall effectively.

Thanks.

Solution

You could perhaps use some Information Theory for this.

For each key, you could compute the entropy of the values. The higher the entropy, the more diverse the key's values are. You could use that to rank the keys.

The following article discusses some related topics: Calculating Entropy for Data Mining.