Search code examples
pythonpandasnumpynumbabodo

Speeding up group-wise differencing in Pandas


Consider the following solution to computing a within-group diff in Pandas:

df =  df.set_index(['ticker', 'date']).sort_index()[['value']]
df['diff'] = np.nan
idx = pd.IndexSlice

for ix in df.index.levels[0]:
    df.loc[ idx[ix,:], 'diff'] = df.loc[idx[ix,:], 'value' ].diff()

For:

> df
   date ticker  value
0    63      C   1.65
1    88      C  -1.93
2    22      C  -1.29
3    76      A  -0.79
4    72      B  -1.24
5    34      A  -0.23
6    92      B   2.43
7    22      A   0.55
8    32      A  -2.50
9    59      B  -1.01

It returns:

> df
             value  diff
ticker date             
A      22     0.55   NaN
       32    -2.50 -3.05
       34    -0.23  2.27
       76    -0.79 -0.56
B      59    -1.01   NaN
       72    -1.24 -0.23
       92     2.43  3.67
C      22    -1.29   NaN
       63     1.65  2.94
       88    -1.93 -3.58

The solution does not scale well for large dataframes. It takes minutes for a dataframe with a shape (405344,2). This is presumably the case because I am iterating through each value for the first level in the main loop.

Is there any way of speeding this up in Pandas? Is looping through index values a good way of solving this problem? Could numba perhaps be used for this?


Solution

  • Here's another way, which ought to be a lot faster.

    First, sort based on ticker and date:

    In [11]: df = df.set_index(['ticker', 'date']).sort_index()
    
    In [12]: df
    Out[12]:
                 value
    ticker date
    A      22     0.55
           32    -2.50
           34    -0.23
           76    -0.79
    B      59    -1.01
           72    -1.24
           92     2.43
    C      22    -1.29
           63     1.65
           88    -1.93
    

    Add the diff column:

    In [13]: df['diff'] = df['value'].diff()
    

    To fill in the NaNs, we can find the first line as follows (there may be a nicer way):

    In [14]: s = pd.Series(df.index.labels[0])
    
    In [15]: s != s.shift()
    Out[15]:
    0     True
    1    False
    2    False
    3    False
    4     True
    5    False
    6    False
    7     True
    8    False
    9    False
    dtype: bool
    
    In [16]: df.loc[(s != s.shift()).values 'diff'] = np.nan
    
    In [17]: df
    Out[17]:
                 value  diff
    ticker date
    A      22     0.55   NaN
           32    -2.50 -3.05
           34    -0.23  2.27
           76    -0.79 -0.56
    B      59    -1.01   NaN
           72    -1.24 -0.23
           92     2.43  3.67
    C      22    -1.29   NaN
           63     1.65  2.94
           88    -1.93 -3.58