Search code examples
pythondataframenumpy

speeding up loop calculation of an integral


I have input data stored in a dataframe [16,60000] with columns corresponding to different time. I'm trying to calculate the integral along two different axis using np.trapz at every timestep.

I tried :

  • classical loop + append to list
  • classical loop + storing to an array
  • list comprehension

but I could not see major improvements. How could I speed up this script ?

Here is a minimal snippet:

import numpy as np
import pandas as pd    
import time

time_start = time.time()
# Read data
df_data = pd.DataFrame(np.random.randn(16, 60000))
x_values = np.array([0.  , 0.03, 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 ,
                     0.85, 0.7 , 0.5 , 0.2 , 0.05])
x_values2 = np.array([0.   ,  0.043,  0.083,  0.114,  0.13 ,  0.134,  0.124,  0.102,
                      0.078,  0.056,  0.03 , -0.006, -0.02 , -0.055, -0.069, -0.042])

# Get sample characteristics
Ns = df_data.shape[1]
times = range(Ns)

lt_data = [df_data.iloc[:,i] for i in times]
a = np.array([-np.trapz(y=data, x=x_values) for data in lt_data])
b = np.array([np.trapz(y=data, x=x_values2) for data in lt_data])

time_end = time.time()
elapsed = time_end - time_start
print(f'Elapsed: {elapsed:.1f}s')

Solution

    1. np.trapz can be applied once, instead of calling it in each iteration
    2. work directly on Numpy instead of pandas iloc to remove overhead of dataframe access at each iteration.

    try to modify this way after defining x values

    data_np = df_data.values
    
    a = np.trapz(y=data_np, x=x_values[:, np.newaxis], axis=0)
    b = np.trapz(y=data_np, x=x_values2[:, np.newaxis], axis=0)