Search code examples
pythonpandastime-seriespanel-data

Fill panel data with ranked timepoints in pandas


Given a DataFrame that represents instances of called customers:

import pandas as pd
import numpy as np
df_1 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5]})

The data is ordered by time such that every customer is a time-series and every customer has different timestamps. Thus I need a column that consists of the ranked timepoints:

df_2 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5],
"call_nr" : [0,1,2,0,1,0,1,2,3,0,0,1]})

After trying different approaches I came up with this to create call_nr:

np.concatenate([np.arange(df["customer_id"].value_counts().loc[i]) for i in df["customer_id"].unique()])

It works, but I doubt this is best practice. Is there a better solution?


Solution

  • A simpler solution would be to groupby your 'customer_id' and use cumcount:

    >>> df_1.groupby('customer_id').cumcount()
    
    0     0
    1     1
    2     2
    3     0
    4     1
    5     0
    6     1
    7     2
    8     3
    9     0
    10    0
    11    1
    

    which you can assign back as a column in your dataframe