Search code examples
python-polars

How to create a dummy column indicating the most recent record in a polars dataframe?


Task: find the most recent occurence of a user in the datatable having the information on date of service usage.

Sample data:

sample_pl = pl.DataFrame(
    {
        'id':['id1', 'id1', 'id1', 'id1',
              'id2', 'id2', 'id2', 'id2', 
              'id3', 'id3', 'id3', 'id3'],

        'dt_event': ['2023-06-01', '2023-06-04', '2023-06-10', '2023-06-28',
                     '2023-06-01', '2023-06-04', '2023-06-10', '2023-06-28',
                     '2023-06-01', '2023-06-04', '2023-06-10', '2023-06-28']
    }, 
    schema={
        'id': pl.Utf8, 'dt_event': pl.Utf8
    }
)

Solution

  • Manipulations to convert from str to date and the final answer:

    (
        sample_pl
            .with_columns(pl.col('dt_event').str.to_date())
            .with_columns(
                pl.when(pl.col('dt_event') == pl.col('dt_event').min().over('id'))
                .then(1)
                .otherwise(0)
                .alias('is_most_recent')
                )
     )