Search code examples
normalize

Normalize heavily skewed array of numbers?


I get an array of hashtags back from a web service, that I'm using to build a tag cloud. My issue is with assigning font weights to the tags, because the most popular tag is soooo popular compared to the remaining tags. I get something like this:

total count: 17000 tag1 count 15000 tag2 count 800 tag3 count 150

etc.

If I assign size by percentage I get one huge font, and a bunch of min value fonts. Which is true scale, but it doesn't look right. If I evenly distribute font size - by just dividing max font size by the number of tags then I don't get the disparity that really shows the tag popularity.

Looking for a happy medium where I can easily see tag1's popularity but not have the rest too small to even see.

Hope this makes sense.


Solution

  • Using log(count) should do the job. A logarithm to a value would increase by one when the input increases one order of magnitude. This means that log(100) = 2, log(1000) = 3, log(1000000) = 6, etc if logarithm with base 10 is used.

    Another way of putting it is that logarithms are the inverse function of exponention.

    But Khan probably does a better job explaining it than I do :) https://www.khanacademy.org/math/algebra2/logarithms-tutorial/logarithmic-scale-patterns/v/logarithmic-scale