Search code examples
pythonmatplotliblogarithm

Matplotlib logarithmic scale with zero value


I have a very large and sparse dataset of spam twitter accounts and it requires me to scale the x axis in order to be able to visualize the distribution (histogram, kde etc) and cdf of the various variables (tweets_count, number of followers/following etc).

    > describe(spammers_class1$tweets_count)
  var       n   mean      sd median trimmed mad min    max  range  skew kurtosis   se
1   1 1076817 443.47 3729.05     35   57.29  43   0 669873 669873 53.23  5974.73 3.59

In this dataset, the value 0 has a huge importance (actually 0 should have the highest density). However, with a logarithmic scale these values are ignored. I thought of changing the value to 0.1 for example, but it will not make sense that there are spam accounts that have 10^-1 followers.

So, what would be a workaround in python and matplotlib ?


Solution

  • Add 1 to each x value, then take the log:

    import matplotlib.pyplot as plt
    import numpy as np
    import matplotlib.ticker as ticker
    
    fig, ax = plt.subplots()
    x = [0, 10, 100, 1000]
    y = [100, 20, 10, 50]
    x = np.asarray(x) + 1 
    y = np.asarray(y)
    ax.plot(x, y)
    ax.set_xscale('log')
    ax.set_xlim(x.min(), x.max())
    ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{0:g}'.format(x-1)))
    ax.xaxis.set_major_locator(ticker.FixedLocator(x))
    plt.show()
    

    enter image description here


    Use

    ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{0:g}'.format(x-1)))
    ax.xaxis.set_major_locator(ticker.FixedLocator(x))
    

    to relabel the tick marks according to the non-log values of x.

    (My original suggestion was to use plt.xticks(x, x-1), but this would affect all axes. To isolate the changes to one particular axes, I changed all commands calls to ax, rather than calls to plt.)


    matplotlib removes points which contain a NaN, inf or -inf value. Since log(0) is -inf, the point corresponding to x=0 would be removed from a log plot.

    If you increase all the x-values by 1, since log(1) = 0, the point corresponding to x=0 will not be plotted at x=log(1)=0 on the log plot.

    The remaining x-values will also be shifted by one, but it will not matter to the eye since log(x+1) is very close to log(x) for large values of x.