Search code examples
numpypandasdummy-data

Pandastic way of growing a dataframe


So, I have a year-indexed dataframe that I would like to increment by some logic beyond the end year (2013), say, grow the last value by n percent for 10 years, but the logic could also be to just add a constant, or slightly growing number. I will leave that to a function and just stuff the logic there.

I can't think of a neat vectorized way to do that with arbitrary length of time and logic, leaving a longer dataframe with the extra increments added, and would prefer not to loop it.


Solution

  • The particular calculation matters. In general you would have to compute the values in a loop. Some NumPy ufuncs (such as np.add, np.multiply, np.minimum, np.maximum) have an accumulate method, however, which may be useful depending on the calculation.

    For example, to calculate values given a constant growth rate, you could use np.multiply.accumulate (or cumprod):

    import numpy as np
    import pandas as pd
    N = 10
    index = pd.date_range(end='2013-12-31', periods=N, freq='D')
    df = pd.DataFrame({'val':np.arange(N)}, index=index)
    last = df['val'][-1]
    #             val
    # 2013-12-22    0
    # 2013-12-23    1
    # 2013-12-24    2
    # 2013-12-25    3
    # 2013-12-26    4
    # 2013-12-27    5
    # 2013-12-28    6
    # 2013-12-29    7
    # 2013-12-30    8
    # 2013-12-31    9
    
    # expand df
    index = pd.date_range(start='2014-1-1', periods=N, freq='D')
    df = df.reindex(df.index.union(index))
    
    # compute new values
    rate = 1.1
    df['val'][-N:] = last*np.multiply.accumulate(np.full(N, fill_value=rate))
    

    yields

                      val
    2013-12-22   0.000000
    2013-12-23   1.000000
    2013-12-24   2.000000
    2013-12-25   3.000000
    2013-12-26   4.000000
    2013-12-27   5.000000
    2013-12-28   6.000000
    2013-12-29   7.000000
    2013-12-30   8.000000
    2013-12-31   9.000000
    2014-01-01   9.900000
    2014-01-02  10.890000
    2014-01-03  11.979000
    2014-01-04  13.176900
    2014-01-05  14.494590
    2014-01-06  15.944049
    2014-01-07  17.538454
    2014-01-08  19.292299
    2014-01-09  21.221529
    2014-01-10  23.343682
    

    To increment by a constant value you could simply use np.arange:

    step=2
    df['val'][-N:] = np.arange(last+step, last+(N+1)*step, step)
    

    or cumsum:

    step=2
    df['val'][-N:] = last + np.full(N, fill_value=step).cumsum()
    

    Some linear recurrence relations can be expressed using scipy.signal.lfilter. See for example, Trying to vectorize iterative calculation with numpy and Recursive definitions in Pandas