Search code examples
pythonpandasdataframetime-seriescomparison

Comparing the value of two consecutive rows of a dataframe for each level of a factor variable - Python Pandas


I have a pandas Dataframe containing traders' positions over time, that I created like this:

history = pd.read_csv(r"history.csv")
history = DataFrame(history, columns=['Symbol', 'Size', 'Entry Price',
                                            'Mark Price', 'PNL (ROE %)', 'Last Position Update'])
frames = [historylast, history]
history = pd.concat(frames)
positions = historylast['Symbol'].tolist()
historylast_symbol_set = set(positions)

where historylast is the last scraped database containing current positions, and history is the local copy with previous positions. This is the outcome:

history = history.sort_values('Symbol')
print (history)
      Symbol         Size  ...    PNL (ROE %)  Last Position Update
0    BNBUSDT      250.800  ...    7702.095588   2021-05-01 03:12:09
5    BNBUSDT     1000.800  ...   43351.359565   2021-04-29 03:51:41
0    BTCUSDT       54.422  ...  513277.155788   2021-04-25 21:03:13
0    BTCUSDT       54.422  ...  328896.563684   2021-04-25 21:03:13
1   DOGEUSDT  2600000.000  ...   46896.408000   2021-05-01 08:24:51

This dataframe has been created by putting toghether traders' positions over time. What I would like to do is to see if the last available 'Size' for each coin has changed with respect to the previous one. For example, for BNBUSDT, the last size is 250, reduced by 75% with respect to the previous size which was 1000. For BTCUSDT the Size has not changed since the last time. While for DOGEUSDT there are no previous data to compare, so it s still 100% of bought position.

To achieve this I though I should split the dataframe into different dataframes, one for each symbol, and compute and save the percentage change with a for loop, but I have difficulties, and wonder if there is not a better way. Any help would be much appreciated


Solution

  • Considering the following df as example (will use the column names Symbol and Size as well)

    import pandas as pd 
    
    d = {'Symbol': ["A", "C", "A", "B", "A", "B", "A"], 'Size': [1, 1, 2, 3, 4, 5, 4]}
    df = pd.DataFrame(data=d) 
    
    print(df) 
    >>>>  Symbol  Size
      0      A     1
      1      C     1
      2      A     2
      3      B     3
      4      A     4
      5      B     5
      6      A     4 
    

    To retrieve the last two rows for each Symbol, do the following

    g = df.groupby('Symbol').head(2)
    g = g.sort_values('Symbol').reset_index(drop=True)
    
    print(g)
    >>>   Symbol  Size
       0      A     1
       1      A     2
       2      B     3
       3      B     5
       4      C     1
    

    After that, in order to compute the difference between each Size for the respective group, assuming the values for that change are relevant, create a new column to display that difference with

    g['Difference'] = g.groupby('Symbol').diff()
    
    print(g)
    >>>   Symbol  Size  Difference
     0      A     1         NaN
     1      A     2         1.0
     2      B     3         NaN
     3      B     5         2.0
     4      C     1         NaN 
    

    Note that the first element appears a NaN as it wasn't changed.