Search code examples
pythonnumpypandasvectorizationnumerical-integration

Vectorize integration of pandas.DataFrame


I have a DataFrame of force-displacement data. The displacement array has been set to the DataFrame index, and the columns are my various force curves for different tests.

How do I calculate the work done (which is "the area under the curve")?

I looked at numpy.trapz which seems to do what I need, but I think that I can avoid looping over each column like this:

import numpy as np
import pandas as pd 

forces = pd.read_csv(...)
work_done = {}

for col in forces.columns:
    work_done[col] = np.trapz(forces.loc[col], forces.index))

I was hoping to create a new DataFrame of the areas under the curves rather than a dict, and thought that DataFrame.apply() or something might be appropriate but don't know where to start looking.

In short:

  1. Can I avoid the looping?
  2. Can I create a DataFrame of work done directly?

Thanks in advance for any help.


Solution

  • You could vectorize this by passing the whole DataFrame to np.trapz and specifying the axis= argument, e.g.:

    import numpy as np
    import pandas as pd
    
    # some random input data
    gen = np.random.RandomState(0)
    x = gen.randn(100, 10)
    names = [chr(97 + i) for i in range(10)]
    forces = pd.DataFrame(x, columns=names)
    
    # vectorized version
    wrk = np.trapz(forces, x=forces.index, axis=0)
    work_done = pd.DataFrame(wrk[None, :], columns=forces.columns)
    
    # non-vectorized version for comparison
    work_done2 = {}
    for col in forces.columns:
        work_done2.update({col:np.trapz(forces.loc[:, col], forces.index)})
    

    These give the following output:

    from pprint import pprint
    
    pprint(work_done.T)
    #            0
    # a -24.331560
    # b -10.347663
    # c   4.662212
    # d -12.536040
    # e -10.276861
    # f   3.406740
    # g  -3.712674
    # h  -9.508454
    # i  -1.044931
    # j  15.165782
    
    pprint(work_done2)
    # {'a': -24.331559643023006,
    #  'b': -10.347663159421426,
    #  'c': 4.6622123535050459,
    #  'd': -12.536039649161403,
    #  'e': -10.276861220217308,
    #  'f': 3.4067399176289994,
    #  'g': -3.7126739591045541,
    #  'h': -9.5084536839888187,
    #  'i': -1.0449311137294459,
    #  'j': 15.165781517623724}
    

    There are a couple of other problems with your original example. col is a column name rather than a row index, so it needs to index the second dimension of your dataframe (i.e. .loc[:, col] rather than .loc[col]). Also, you have an extra trailing parenthesis on the last line.


    Edit:

    You could also generate the output DataFrame directly by .applying np.trapz to each column, e.g.:

    work_done = forces.apply(np.trapz, axis=0, args=(forces.index,))
    

    However, this isn't really 'proper' vectorization - you are still calling np.trapz separately on each column. You can see this by comparing the speed of the .apply version against calling np.trapz directly:

    In [1]: %timeit forces.apply(np.trapz, axis=0, args=(forces.index,))
    1000 loops, best of 3: 582 µs per loop
    
    In [2]: %timeit np.trapz(forces, x=forces.index, axis=0)
    The slowest run took 6.04 times longer than the fastest. This could mean that an
    intermediate result is being cached 
    10000 loops, best of 3: 53.4 µs per loop
    

    This isn't an entirely fair comparison, since the second version excludes the extra time taken to construct the DataFrame from the output numpy array, but this should still be smaller than the difference in time taken to perform the actual integration.