I have a list of data called data
. This list contains 21073 nested numpy
arrays. Each numpy array, in turn, contains 512 values. Paradigmatially, one nested numpy array looks as follows:
[534.42623424, 942.2323, 123.73434 ...]
.
Moreover, I have another list of data called freq
:
freq = [0.0009, 0.0053, 0.0034 ...]
which is also 512 digits long.
Aim:
I use a for loop to compute a linear least-square regression via scipy.linregress
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html) between each nested numpy array in list
and freq
as follows:
for i in data:
slope = sp.stats.linregress(x=np.log10(freq), y=np.log10(i))[0]
The code works. But there is one problem.
Problem: I assume that due to the massive size of the for loop (= 21073) I always get the error RuntimeWarning: divide by zero encountered in log10
. Importantly, the position in the for loop where this error occurs varies every time I run the code again. That is, sometimes it occurs at n = 512, then again close to the end at n = 18766.
Question: Am I right that the problem is related to the massive size of the for loop. How can I fix the problem?
You problem has nothing to do with the size of your data array. The warning is issued because one or other of your arrays contains one or more zeroes.
You can suppress the warning but you have to consider the impact this will have on your results.
For example:
import numpy as np
assert np.log10(0) == -np.inf
The log10 operations are vectorised and do not therefore align (as you might expect) with your loop index.
You should not calculate the np.log10(freq) at every iteration of the loop because it wastes valuable time. It will always return the same result.
Consider this:
import numpy as np
from scipy.stats import linregress
from time import perf_counter
np.seterr(all="ignore")
N = 21_073
data = [np.random.rand(512) for _ in range(N)]
freq = [np.random.rand(512)]
x = np.log10(freq)
if 0 in x:
print("freq contains at least one zero value. Results may not be as expected")
start = perf_counter()
for datum in data:
if 0 in datum:
print("datum contains at least one zero value, Results may not be as expected")
slope = linregress(x=x, y=np.log10(datum)).slope # type: ignore
print(f"Duration={perf_counter()-start:.4f}s")
Output:
Duration=0.6616s