Search code examples
pythonpython-3.xmatplotlibseabornaxes

Can I mimic a log scale of an axis in matplotlib without transforming the associated data?


I am trying to display a Zipf plot, which is typically displayed on a log-log scale.

I'm using a library which gives rank in linear scale and frequencies in log scale. I have the following code which plots my data fairly correctly:

ranks = [3541, 60219, 172644, 108926, 733215, 1297533, 1297534, 1297535]
# These frequencies are already log-scale
freqs = [-10.932271003723145, -15.213129043579102, -17.091760635375977, -16.27560806274414, 
        -19.482173919677734, -19.502029418945312, -19.502029418945312, -19.502029418945312]

data = {
    'ranks': ranks, 
    'freqs': freqs,
}

df = pd.DataFrame(data=data)

_, ax = plt.subplots(figsize=(7, 7))
ax.set(xscale="log", yscale="linear")
ax.set_title("Zipf plot")
sns.regplot("ranks", "freqs", data=df, ax=ax, fit_reg=False)
ax.set_xlabel("Frequency rank of token")
ax.set_ylabel("Absolute frequency of token")
ax.grid(True, which="both")
plt.show()

The resulting plot is:

Zipf 1

The plot looks good, but the y-label is weird. I'd like it to be displayed in log-increments as well. My current workaround is to raise 10 to the power of each element in the freqs list; i.e.,

freqs = [10**freq for freq in freqs]
# ...

and change the yscale in ax.set to log; i.e.,

_, ax = plt.subplots(figsize=(7, 7))
ax.set(xscale="log", yscale="log")
ax.set_title("Zipf plot")
# ...

This gives me the expected plot (below), but it requires a transform of the data which is a) relatively expensive, b) redundant, c) lossy.

Zipf 2

Is there a way to mimic the log scale of the axes in a matplotlib plot without transforming the data?


Solution

  • First a comment: Personally i would prefer the method of rescaling the data, since it makes everything much easier at the expense of some more memory/cpu time and accurary should not matter

    Now to the question, which is acutally how to mimic a log scale on a linear axis

    Solution 1: mimic the log scale

    This is not easy. Setting the axes to log scale changes a lot in the background and one needs to mimic all of that.

    • The easy part is to set the major tickmark frequency to 1 by using matplotlib.ticker.MultipleLocator()
    • Creating the minor tickmarks at positions which look logarithmic is harder. The best solution I could come up with is to set them manually using the matplotlib.ticker.FixedLocator()
    • Last we need to change the tickmarks to represent the actual numbers, meaning that they should look like 10^(-x) instead of -x. I am aware of two options here:
      • Using a FuncFormatter that sets the values 10**x in scientific format.
      • Using a FuncFormatter that sets the values 10^x in Latex format. This looks much nicer but contrasts to the rest of the plot.

    I do not know any better solution for that last point, but maybe someone else does.

    Here is the code and how it looks.

    import matplotlib.pyplot as plt
    import seaborn as sns
    import pandas as pd
    import numpy as np
    from matplotlib.ticker import MultipleLocator, FixedLocator, FuncFormatter
    
    ###### Locators for Y-axis
    # set tickmarks at multiples of 1.
    majorLocator = MultipleLocator(1.)
    # create custom minor ticklabels at logarithmic positions
    ra = np.array([ [n+(1.-np.log10(i))]  for n in xrange(10,20) for i in [2,3,4,5,6,7,8,9][::-1]]).flatten()*-1.
    minorLocator = FixedLocator(ra)
    ###### Formatter for Y-axis (chose any of the following two)
    # show labels as powers of 10 (looks ugly)
    majorFormatter= FuncFormatter(lambda x,p: "{:.1e}".format(10**x) ) 
    # or using MathText (looks nice, but not conform to the rest of the layout)
    majorFormatter= FuncFormatter(lambda x,p: r"$10^{"+"{x:d}".format(x=int(x))+r"}$" ) 
    
    ranks = [3541, 60219, 172644, 108926, 733215, 1297533, 1297534, 1297535]
    # These frequencies are already log-scale
    freqs = [-10.932271003723145, -15.213129043579102, -17.091760635375977, -16.27560806274414, 
            -19.482173919677734, -19.502029418945312, -19.502029418945312, -19.502029418945312]
    
    data = {
        'ranks': ranks, 
        'freqs': freqs,
    }
    
    df = pd.DataFrame(data=data)
    
    _, ax = plt.subplots(figsize=(6, 6))
    ax.set(xscale="log", yscale="linear")
    ax.set_title("Zipf plot")
    
    sns.regplot("ranks", "freqs", data=df, ax=ax, fit_reg=False)
    
    # Set the locators
    ax.yaxis.set_major_locator(majorLocator)
    ax.yaxis.set_minor_locator(minorLocator)
    # Set formatter if you like to have the ticklabels consistently in power notation
    ax.yaxis.set_major_formatter(majorFormatter)
    
    ax.set_xlabel("Frequency rank of token")
    ax.set_ylabel("Absolute frequency of token")
    ax.grid(True, which="both")
    plt.show()
    

    enter image description here

    Solution 2: Use a different axes

    A different solution, of which I did not think in the first place, would be to use two different axes, one with a loglog scale which looks nice and produces the correct labels and ticks and anotherone to plot the data to.

    import matplotlib.pyplot as plt
    import seaborn as sns
    import pandas as pd
    import numpy as np
    
    ranks = [3541, 60219, 172644, 108926, 733215, 1297533, 1297534, 1297535]
    # These frequencies are already log-scale
    freqs = [-10.932271003723145, -15.213129043579102, -17.091760635375977, -16.27560806274414, 
            -19.482173919677734, -19.502029418945312, -19.502029418945312, -19.502029418945312]
    
    data = {
        'ranks': ranks, 
        'freqs': freqs,
    }
    
    df = pd.DataFrame(data=data)
    
    fig, ax = plt.subplots(figsize=(6, 6))
    # use 2 axes
    # ax is the log, log scale which produces nice labels and ticks
    ax.set(xscale="log", yscale="log")
    ax.set_title("Zipf plot")
    # ax2 is the axes where the values are plottet to
    ax2 = ax.twinx()
    
    #plot values to ax2
    sns.regplot("ranks", "freqs", data=df, ax=ax2, fit_reg=False)
    # set the limits of the log log axis to 10 to the power of the label of ax2
    ax.set_ylim(10**np.array(ax2.get_ylim())  )
    
    
    ax.set_xlabel("Frequency rank of token")
    ax.set_ylabel("Absolute frequency of token")
    # remove ticklabels and axislabel from ax2
    ax2.set_yticklabels([])
    ax2.set_ylabel("")
    ax.grid(True, which="both")
    plt.show()
    

    enter image description here