Search code examples
pythonpython-polars

How to calculate time between rows for each ID in a Pandas DataFrame using Polars?


Certainly! Here is a draft of your Stack Overflow question with proper code formatting:

Title: How to calculate time between sessions for each ID in a Pandas DataFrame using Polars?

Question:

I am working on a task where I have a Pandas DataFrame using Polars library in Python, containing columns for 'ID' and 'Timestamp'. Each row represents the end of a session identified by the 'Timestamp'. I am trying to create a new column called 'time_since_last_session', which should contain the time duration between sessions for each unique 'ID'.

I have been able to compute the time difference between sessions for a specific filtered 'ID' using the following code:

import polars as pl

# DataFrame: sessions_features
# Columns: 'ID', 'Timestamp'

filtered_id = "BBIISSIOTNIFSIDYIUSA"
time_diff = sessions_features.filter(pl.col("ID") == filtered_id)["Timestamp"].diff().dt.seconds()

However, I'm struggling to perform this calculation for each 'ID' in the DataFrame using a group_by() operation or similar. I have attempted to use map_groups() but haven't been successful.

Could someone please guide me on how to perform this operation efficiently for each 'ID' using Polars? A minimal reproducible example would be this:

import polars as pl
import pandas as pd

# Creating a sample DataFrame
data = {
    'ID': ['A', 'A', 'A', 'B', 'B', 'B'],
    'Timestamp': ['2023-01-01 10:00:00', '2023-01-01 10:30:00' ,'2023-01-01 11:00:00', '2023-01-01 12:00:00', '2023-01-01 12:30:00', '2023-01-01 13:00:00']
}

df = pd.DataFrame(data)

# Converting to Polars DataFrame
sessions_features = pl.from_pandas(df)

sessions_features = sessions_features.with_columns(
   pl.col("Timestamp").str.to_datetime()
)

print(sessions_features.filter(pl.col("ID") == "A")["Timestamp"].diff().dt.seconds())

This example creates a sample DataFrame and calculates the time difference between sessions for a specific 'ID'. However, the goal is to perform this calculation for each unique 'ID' in the DataFrame efficiently using Polars. Any help or insights would be greatly appreciated!

the expected result for the final df in the minimum example would be:

┌─────┬─────────────────────┬───────────────────────┐
│ ID  ┆ Timestamp           ┆ time_between_sessions │
│ --- ┆ ---                 ┆ ---                   │
│ str ┆ datetime[μs]        ┆ i64                   │
╞═════╪═════════════════════╪═══════════════════════╡
│ A   ┆ 2023-01-01 10:00:00 ┆ 0                     │
│ A   ┆ 2023-01-01 10:30:00 ┆ 1800                  │
│ A   ┆ 2023-01-01 11:00:00 ┆ 1800                  │
│ B   ┆ 2023-01-01 12:00:00 ┆ 0                     │
│ B   ┆ 2023-01-01 12:30:00 ┆ 1800                  │
│ B   ┆ 2023-01-01 13:00:00 ┆ 1800                  │
└─────┴─────────────────────┴───────────────────────┘

Solution

  • You can group by ID and then apply the rolling diff to each group:

    df.group_by("ID").map_groups(
        lambda g: g.with_columns(
            pl.col("Timestamp").diff().dt.total_seconds().fill_null(0).alias("Diff")
        )
    )