Search code examples
pythonpandasnumpygroup-byshift

Groupby shift (lagged values) analogue with only Numpy (no pandas)


I have a dataframe that looks like this:

          id    date       v1
0          0  1983.0    1.574
1          0  1984.0    1.806
2          0  1985.0    4.724
3          1  1986.0    0.320
4          1  1987.0    3.414
     ...     ...      ...
107191  9874  1993.0   52.448
107192  9874  1994.0  108.652
107193  9875  1992.0    1.597
107194  9875  1993.0    3.134
107195  9875  1994.0    7.619

I want to generate a new column that has lagged values of v1 ordered by id. In pandas I would use

df.groupby('id')['v1'].shift(-1)

However, I want to translate this in pure matrix/array form using only Numpy. What is the most straightforward way to get the analogue in Numpy? I need to avoid using pandas tools because I want to use the Numba @jit later on.


Solution

  • IIUC, you are looking to implement df.groupby('id')['v1'].shift(-1) purely in numpy. This is composed of a grouper and a shift method.

    A groupby() equivalent in numpy for 2D array with first grouping column and second value column is -

    np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])
    

    And the shift() equivalent in numpy for a 1D array is -

    np.append(np.roll(arr,-1)[:-1], np.nan)
    

    Putting these 2 together you can get what you want -

    #2D array with only id and v1 as columns
    arr = df[['id','v1']].values   
    
    #Groupby based on id
    grouper = np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:]) 
    
    #apply shift to grouped elements
    shift = [np.append(np.roll(i,-1)[:-1], np.nan) for i in grouper] 
    
    #stack them as a single array
    new_col = np.hstack(shift) 
    
    #set as column
    df['shifted'] = new_col 
    

    Test on dummy data -

    #Dummy data
    idx = [0,0,0,0,0,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3]
    val = np.arange(len(idx))
    arr = np.array([idx, val]).T
    df = pd.DataFrame(arr, columns=['id','v1'])
    
    #apply grouped shifting
    arr = df[['id','v1']].values
    grouper = np.split(arr[:,1], np.unique(arr[:, 0], return_index=True)[1][1:])
    shift = [np.append(np.roll(i,-1)[:-1], np.nan) for i in grouper]
    new_col = np.hstack(shift)
    df['shifted'] = new_col
    
    print(df)
    
        id  v1  shifted
    0    0   0      1.0
    1    0   1      2.0
    2    0   2      3.0
    3    0   3      4.0
    4    0   4      NaN
    5    1   5      6.0
    6    1   6      7.0
    7    1   7      8.0
    8    1   8      9.0
    9    1   9     10.0
    10   1  10      NaN
    11   2  11     12.0
    12   2  12     13.0
    13   2  13     14.0
    14   2  14      NaN
    15   3  15     16.0
    16   3  16     17.0
    17   3  17     18.0
    18   3  18     19.0
    19   3  19      NaN