Search code examples
pythonpandasfunctioncountseries

Restarting count column when different row appears in Python/Pandas


I have the following Pandas DataFrame with multiple amounts for each address. The count of 'amounts' per address varies.

Index                                           type    amount
0   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  outflow 2.553800e+04
1   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  inflow  1.999650e+05
2   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  inflow  3.896400e+04
3   0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F  inflow  3.060000e+05
4   0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow 1.569367e+05
5   0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow 1.219739e+04

I want to create a column that would count from 1 to n rows per address, but I don't know how to restart the count at the next address.

Something that would look like this:

Index                                           type    amount          Epoch
0   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  outflow 2.553800e+04    1
1   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  inflow  1.999650e+05    2
2   0xd81c0B4FEA284c908C5700187a67698b416a6bcc  inflow  3.896400e+04    3
3   0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F  inflow  3.060000e+05    1
4   0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow 1.569367e+05    1
5   0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow 1.219739e+04    2

As you can see, the count of epochs restarts when a row with a new address appears.

How can I create a logic for that column for any given number of addresses/rows?

Additionally: is there anything I should watch out for when structuring the DataFrame? E.G., always grouping equal addresses and not having them appear in random places in the DataFrame.


Solution

  • Use groupby with cumcount:

    1. If you want the count to continue if an address re-occurs later:
    df["Epoch"] = df.groupby("Index").cumcount()+1
    
    >>> df
                                            Index     type     amount  Epoch
    0  0xd81c0B4FEA284c908C5700187a67698b416a6bcc  outflow   25538.00      1
    1  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow  199965.00      2
    2  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow   38964.00      3
    3  0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F   inflow  306000.00      1
    4  0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow  156936.70      1
    5  0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow   12197.39      2
    6  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow  199965.00      4
    7  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow   38964.00      5
    
    1. If you want the count to re-start at 1 for the address that re-occurs:
    df["Epoch"] = df.groupby((df["Index"]!=df["Index"].shift()).cumsum()).cumcount()+1
    
    >>> df
                                            Index     type     amount  Epoch
    0  0xd81c0B4FEA284c908C5700187a67698b416a6bcc  outflow   25538.00      1
    1  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow  199965.00      2
    2  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow   38964.00      3
    3  0x0A5E7C50eA6BB695F2f4e75D97D3381592B59C9F   inflow  306000.00      1
    4  0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow  156936.70      1
    5  0x2Eec494429E253938A10b2A9eCAD8ee7F603e4Af  outflow   12197.39      2
    6  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow  199965.00      1
    7  0xd81c0B4FEA284c908C5700187a67698b416a6bcc   inflow   38964.00      2
    

    Note the difference in output in the last two rows. I copied the second and third rows of your example to the end of the DataFrame to illustrate the difference in the two methods.