Search code examples
pythonpandasnumpyscipyintegral

More efficient way? other than using for loop with np.trapz() to generate an integral curve over large pandas dataframe


I have a large pandas Data Frame of around 210711 rows.

Currently I am using for loop for generating integral of particular column and plotly for plotting the data.

Following is the code used:

import pandas as pd
import numpy as np
import plotly as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots

DF = pd.read_csv(r"C:\Users\hsr4ban\Desktop\Temp\Temp.csv")

DF_01 = DF[["timestamps", "Data01", "Data02"]]

Integral_List = []

for ind in range(len(DF_01)):
    Integral = np.trapz(y = DF_01.loc[:ind, "Data02"], x = DF_01.loc[:ind, "timestamps"])
    Integral_List.append(Integral)

DF_01["Integral"] = Integral_List

Fig = make_subplots(rows = 2,
                    cols = 1,
                    shared_xaxes = True)

Fig.add_trace(go.Scatter(x = DF_01["timestamps"],
                         y = DF_01["Data02"],
                         name = "Data",
                         mode = "lines"),
              row = 1, col = 1)

Fig.add_trace(go.Scatter(x = DF_01["timestamps"],
                         y = DF_01["Integral"],
                         name = "Integral",
                         mode = "lines"),
                row = 2, col = 1)

Fig.show()

When above code is run, I get following graph as output: enter image description here

However, when for loop is used for generating "Integral" column, it is consuming lots of time. Is there any efficient way to do this? Any suggestions would be helpful.


Solution

  • You can replace the loop with scipy.integrate.cumulative_trapezoid.

    from scipy.integrate import cumulative_trapezoid
    DF_01["Integral"] = cumulative_trapezoid(y=DF_01["Data02"], 
                                             x=DF_01["timestamps"],
                                             initial=0)