when transforming data to log scale for charting purposes, is it more "correct" in some way to always transform using np.log1p
than with np.log
and does it break any common user expectations?
I'm building a charting software with log scale capabilities, and wonder if I should use np.log
or np.log1p
as the default choice when transforming the data.
here's a vastly simplified code sample:
import matplotlib.pyplot as plt
def chart_with_log_scale(x,y):
ylog = np.log(y) # should I be using np.log1p here instead?
plt.scatter(x,ylog )
or put a different perspective on this, does matplotlib
use log1p
or log
when it does its log transform in code such as this?
def chart_with_log_scale2(x,y):
plt.scatter(x,y)
ax = plt.gca()
ax.set_yscale("log")
when transforming data to log scale for charting purposes, is it more "correct" in some way to always transform using
np.log1p
than withnp.log
and does it break any common user expectations?
It is almost never correct to use np.log1p
instead of np.log
if your goal is to compute log(๐ฅ).
Here's an example of a plot with the y axis in log scale, of the probability density function for the Beta distribution with ๐ผ = 2 and ๐ฝ = 5:
Here's the same function with the y axis in log1p scale instead:
If I tried to pass this off as a log scale plot of the Beta(2,5) PDF as a grad student, my advisor would probably shoot me dead on the spot.
(Exception: If your inputs are always greater than 253 on a machine with IEEE 754 binary64 arithmetic, then the two will functions will most likely coincide. But that is only because log(1 + ๐ฅ) has such low relative error from log(๐ฅ) on such inputsโthat is, |log(1 + ๐ฅ) โ log(๐ฅ)|/|log(๐ฅ)| = |log(๐ฅโ (1/๐ฅ + 1)) โ log(๐ฅ)|/log(๐ฅ) = log(1 + 1/๐ฅ)/log(๐ฅ) < 1/๐ฅ < 2โ53 so log(1 + ๐ฅ) is at worst a rounding error away from log(๐ฅ).)
In a comment, you asked:
log1p could be what I want if the values are very close to 0, as it will have better numerical stability than log, right?
The functions log1p and log are just mathematical functions. Neither one has โbetter numerical stabilityโ than the other: โnumerical stabilityโ is not even a well-defined concept, and certainly not of a mathematical function. An algorithm for computing a mathematical function can exhibit forward or backward stability; what this property means is relative to the function it aims to compute. But log and log1p are simply mathematical functions, not algorithms for computing functions, and as such, forward and backward stability do not apply.
The importance of log1p is that the function log(1 + ๐ฅ) is well-conditioned near zero, and often turns up in numerical algorithms or algebraic rearrangements of other functions. Well-conditioned means that if you evaluate it at a point ๐ฅโ (1 + ๐ฟ) when you actually wanted to evaluate it at ๐ฅ, then the answer log(1 + ๐ฅโ (1 + ๐ฟ)) is equal to log(1 + ๐ฅ)โ (1 + ๐) where ๐ is a reasonably small multiple of ๐ฟ as long as ๐ฟ is reasonably small. Here ๐ฟ is the relative error of the input ๐ฅโ (1 + ๐ฟ) from ๐ฅ, and ๐ is the relative error of the output log(1 + ๐ฅ)โ (1 + ๐) from log(๐ฅ).
In contrast, the function log(๐ฆ) is ill-conditioned near 1: evaluate log(๐ฆโ
(1 + ๐ฟ)) when you want log(๐ฆ) for some point ๐ฆ near 1, and what you get back may be log(๐ฆ)โ
(1 + ๐) for an arbitrarily bad error ๐, even if the input error ๐ฟ was quite small.
For example, suppose you want to evaluate log(1.000000000000001) โ 9.999999999999995โรโ10โ16.
If you write np.log(1.000000000000001)
in a Python program, the decimal constant 1.000000000000001
will be rounded to the nearest binary64 floating-point number, and so you will actually evaluate log(fl(1.000000000000001)) = log(1.0000000000000011102230246251565404236316680908203125) โ 1.110223024625156โรโ10โ15.
Although 1.0000000000000011102230246251565404236316680908203125 is a good approximation to 1.000000000000001, with relative error ๐ฟ < 10โ15, log(1.0000000000000011102230246251565404236316680908203125) is a terrible approximation to log(1.000000000000001), with relative error ๐ > 11%.
This is not the fault of np.log
, which did an admirable job of returning the correctly rounded result to the question we asked.
This is because the mathematical function log is ill-conditioned near 1, so it magnified the tiny error 10โ15 in the input we asked about from the input we wanted to ask aboutโand not just magnified, but magnified by a quadrillionfold!
So if you find yourself in possession of a small real number ๐ฅ, and you find yourself wanting to know what log(1 + ๐ฅ) is, then you should use np.log1p(x)
to answer this question.
(Or you may wish to rearrange a computation in terms of log(โฆ) so that it uses log(1 + โฆ) instead; e.g., to compute logit(๐) = log(๐/(1 โ ๐)) for a given ๐ near 1/2, you are better off rewriting it as log(1 + (1 โ 2๐)/๐).)
If you wrote np.log(1 + x)
instead of np.log1p(x)
, then the subexpression 1 + x
may commit a rounding error, giving 1 โ ๐ฅ = fl(1 + ๐ฅ) = (1 + ๐ฅ)โ
(1 + ๐ฟ).
Although the rounding error is small (in binary64 arithmetic, you are guaranteed that |๐ฟ| โค 2โ53), the log function may magnify it into an arbitrarily large error ๐ in the output.
But if you already have a number ๐ฆ, even if it is near zero, and find yourself wanting log(๐ฆ), then np.log(y)
will give a good approximation to log(๐ฆ), and np.log1p(y)
will give a terrible one (unless ๐ฆ is very large).
This is the scenario you seem to find yourself in.
Could np.log1p
ever be relevant in plotting data on a log scale?
Perhaps, if what you compute is ๐ฅ and what you wish to plot is 1 + ๐ฅ in log-scale.
But it's unlikely that this combination of circumstancesโcomputing ๐ฅ, and plotting 1 + ๐ฅ in log scaleโmakes sense together:
set terminal pngcairo
set output "logscale.png"
set title 'log scale'
set xrange [0:1]
set logscale y
plot x**(2 - 1) * (1 - x)**(5 - 1) notitle
set terminal pngcairo
set output "log1pscale.png"
set title 'log1p scale'
set xrange [0:1]
set yrange [1:1.1]
set logscale y 2
set ytics 1.1**(1/4.0)
plot 1 + x**(2 - 1) * (1 - x)**(5 - 1) notitle