Search code examples
pythonnumpymatplotlibdata-visualizationfloating-accuracy

is log1p the "correct" way of doing log scale transformation of charts?


when transforming data to log scale for charting purposes, is it more "correct" in some way to always transform using np.log1p than with np.log and does it break any common user expectations?

I'm building a charting software with log scale capabilities, and wonder if I should use np.log or np.log1p as the default choice when transforming the data.

here's a vastly simplified code sample:

import matplotlib.pyplot as plt
def chart_with_log_scale(x,y):
  ylog = np.log(y) # should I be using np.log1p here instead?
  plt.scatter(x,ylog )

or put a different perspective on this, does matplotlib use log1p or log when it does its log transform in code such as this?

def chart_with_log_scale2(x,y):
  plt.scatter(x,y)
  ax = plt.gca()
  ax.set_yscale("log")

Solution

  • when transforming data to log scale for charting purposes, is it more "correct" in some way to always transform using np.log1p than with np.log and does it break any common user expectations?

    It is almost never correct to use np.log1p instead of np.log if your goal is to compute log(๐‘ฅ).

    Here's an example of a plot with the y axis in log scale, of the probability density function for the Beta distribution with ๐›ผ = 2 and ๐›ฝ = 5:

    log PDF of Beta(2,5)

    Here's the same function with the y axis in log1p scale instead:

    log 1 + PDF of Beta(2,5)

    If I tried to pass this off as a log scale plot of the Beta(2,5) PDF as a grad student, my advisor would probably shoot me dead on the spot.

    (Exception: If your inputs are always greater than 253 on a machine with IEEE 754 binary64 arithmetic, then the two will functions will most likely coincide. But that is only because log(1 + ๐‘ฅ) has such low relative error from log(๐‘ฅ) on such inputsโ€”that is, |log(1 + ๐‘ฅ) โˆ’ log(๐‘ฅ)|/|log(๐‘ฅ)| = |log(๐‘ฅโ‹…(1/๐‘ฅ + 1)) โˆ’ log(๐‘ฅ)|/log(๐‘ฅ) = log(1 + 1/๐‘ฅ)/log(๐‘ฅ) < 1/๐‘ฅ < 2โˆ’53 so log(1 + ๐‘ฅ) is at worst a rounding error away from log(๐‘ฅ).)


    In a comment, you asked:

    log1p could be what I want if the values are very close to 0, as it will have better numerical stability than log, right?

    The functions log1p and log are just mathematical functions. Neither one has โ€œbetter numerical stabilityโ€ than the other: โ€œnumerical stabilityโ€ is not even a well-defined concept, and certainly not of a mathematical function. An algorithm for computing a mathematical function can exhibit forward or backward stability; what this property means is relative to the function it aims to compute. But log and log1p are simply mathematical functions, not algorithms for computing functions, and as such, forward and backward stability do not apply.

    The importance of log1p is that the function log(1 + ๐‘ฅ) is well-conditioned near zero, and often turns up in numerical algorithms or algebraic rearrangements of other functions. Well-conditioned means that if you evaluate it at a point ๐‘ฅโ‹…(1 + ๐›ฟ) when you actually wanted to evaluate it at ๐‘ฅ, then the answer log(1 + ๐‘ฅโ‹…(1 + ๐›ฟ)) is equal to log(1 + ๐‘ฅ)โ‹…(1 + ๐œ€) where ๐œ€ is a reasonably small multiple of ๐›ฟ as long as ๐›ฟ is reasonably small. Here ๐›ฟ is the relative error of the input ๐‘ฅโ‹…(1 + ๐›ฟ) from ๐‘ฅ, and ๐œ€ is the relative error of the output log(1 + ๐‘ฅ)โ‹…(1 + ๐œ€) from log(๐‘ฅ).

    In contrast, the function log(๐‘ฆ) is ill-conditioned near 1: evaluate log(๐‘ฆโ‹…(1 + ๐›ฟ)) when you want log(๐‘ฆ) for some point ๐‘ฆ near 1, and what you get back may be log(๐‘ฆ)โ‹…(1 + ๐œ€) for an arbitrarily bad error ๐œ€, even if the input error ๐›ฟ was quite small. For example, suppose you want to evaluate log(1.000000000000001) โ‰ˆ 9.999999999999995โ€‰ร—โ€‰10โˆ’16. If you write np.log(1.000000000000001) in a Python program, the decimal constant 1.000000000000001 will be rounded to the nearest binary64 floating-point number, and so you will actually evaluate log(fl(1.000000000000001)) = log(1.0000000000000011102230246251565404236316680908203125) โ‰ˆ 1.110223024625156โ€‰ร—โ€‰10โˆ’15.

    Although 1.0000000000000011102230246251565404236316680908203125 is a good approximation to 1.000000000000001, with relative error ๐›ฟ < 10โˆ’15, log(1.0000000000000011102230246251565404236316680908203125) is a terrible approximation to log(1.000000000000001), with relative error ๐œ€ > 11%. This is not the fault of np.log, which did an admirable job of returning the correctly rounded result to the question we asked. This is because the mathematical function log is ill-conditioned near 1, so it magnified the tiny error 10โˆ’15 in the input we asked about from the input we wanted to ask aboutโ€”and not just magnified, but magnified by a quadrillionfold!

    So if you find yourself in possession of a small real number ๐‘ฅ, and you find yourself wanting to know what log(1 + ๐‘ฅ) is, then you should use np.log1p(x) to answer this question. (Or you may wish to rearrange a computation in terms of log(โ€ฆ) so that it uses log(1 + โ€ฆ) instead; e.g., to compute logit(๐‘) = log(๐‘/(1 โˆ’ ๐‘)) for a given ๐‘ near 1/2, you are better off rewriting it as log(1 + (1 โˆ’ 2๐‘)/๐‘).) If you wrote np.log(1 + x) instead of np.log1p(x), then the subexpression 1 + x may commit a rounding error, giving 1 โŠ• ๐‘ฅ = fl(1 + ๐‘ฅ) = (1 + ๐‘ฅ)โ‹…(1 + ๐›ฟ). Although the rounding error is small (in binary64 arithmetic, you are guaranteed that |๐›ฟ| โ‰ค 2โˆ’53), the log function may magnify it into an arbitrarily large error ๐œ€ in the output.

    But if you already have a number ๐‘ฆ, even if it is near zero, and find yourself wanting log(๐‘ฆ), then np.log(y) will give a good approximation to log(๐‘ฆ), and np.log1p(y) will give a terrible one (unless ๐‘ฆ is very large). This is the scenario you seem to find yourself in.

    Could np.log1p ever be relevant in plotting data on a log scale? Perhaps, if what you compute is ๐‘ฅ and what you wish to plot is 1 + ๐‘ฅ in log-scale. But it's unlikely that this combination of circumstancesโ€”computing ๐‘ฅ, and plotting 1 + ๐‘ฅ in log scaleโ€”makes sense together:

    • If you have a good reason for computing ๐‘ฅ as a proxy for 1 + ๐‘ฅ, most likely you are concerned primarily with values of ๐‘ฅ near zeroโ€”otherwise there's not much benefit to the representationโ€”and therefore most likely you are plotting values of 1 + ๐‘ฅ near 1.
    • But if you are plotting values of 1 + ๐‘ฅ near 1, then there is very little reason to use a log scale, because the closer your data points are to 1, the less difference there is between a log scale and a linear scale at all!

    log scale gnuplot

    set terminal pngcairo
    set output "logscale.png"
    set title 'log scale'
    set xrange [0:1]
    set logscale y
    plot x**(2 - 1) * (1 - x)**(5 - 1) notitle
    

    log1p scale gnuplot

    set terminal pngcairo                            
    set output "log1pscale.png"
    set title 'log1p scale'
    set xrange [0:1]
    set yrange [1:1.1]
    set logscale y 2
    set ytics 1.1**(1/4.0)
    plot 1 + x**(2 - 1) * (1 - x)**(5 - 1) notitle