Search code examples
pythonpandasgrouping

Numbering occurrences within groups in python


I have a pandas dataframe with information about students and test dates. I would like to create a variable that takes on a new value for each student, but also takes on a new value for the same student if 5 years have passed without a test attempt. The desired column is "group" below. How can I do this in python?

Student test_date group 
Bob     1995      1
Bob     1997      1   
Bob     2020      2 
Bob     2020      2
Mary    2020      3
Mary    2021      3
Mary    2021      3

The initial, very clunky idea I had was to sort by name, sort by date, calculate the difference in date, have an ind if diff > 5, and then somehow number by groups.

    ds = pd.read_excel('../students.xlsx')
    ds = ds.sort_values(by=['student','test_date'])
    ds['time'] = ds['test_date'].diff() 
    ds['break'] = 0 
    ds.loc[(ds['time'] > 5),'break'] = 1

Student test_date time break
Bob     1995     na     na
Bob     1997     2      0
Bob     2020     23     1
Bob     2020     0      0
Mary    2020     na     na
Mary    2021     1      0
Mary    2021     0      0

Solution

  • df = df.sort_values(["Student", "test_date"])
    ((df.Student != df.Student.shift()) | (df.test_date.diff().gt(5))).cumsum()
    # 0    1
    # 1    1
    # 2    2
    # 3    2
    # 4    3
    # 5    3
    # 6    3
    # dtype: int32