convert a pandas loc operation that needed the index to assign values to polars

In this example i have three columns, the 'DayOfWeek' Time' and the 'Risk'.

I want to group by 'DayOfWeek' and take the first element only and assign a high risk on it. This means the first known hour in day of week is the one that has the highest risk. The rest is initialized to 'Low' risk.

In pandas i had an additional column for the index, but in polars i do not. I could artificially create one, but is it even necessary?

Can i do this somehow smarter with polars?

df['risk'] = "Low"

df = df.sort('Time')
df.loc[df.groupby("DayOfWeek").head(1).index, "risk"] = "High"

The index is unique in this case and goes to range(n)

Here is my solution btw. (I don't really like it)

df = df.with_columns(pl.int_range(df.shape[0]).alias('pseudo_index')

# find lowest time for day
indexes_df = df.sort('Time').group_by('DayOfWeek').head(1)
# Set 'High' as col for all rows from group_by

indexes_df = indexes_df.select('pseudo_index').with_columns(pl.lit('High').alias('risk'))

# Left join will generate null values for all values that are not in indexes_df 'pseudo_index'
df = df.join(indexes_df, how='left', on='pseudo_index').select(
pl.exclude('pseudo_index', 'risk'), pl.col('risk').fill_null(pl.lit('low'))
)

Solution

You can use window functions to find where the first "index" of the "DayOfWeek" group equals the"index" column.

For that we only need to set an "index" column. We can do that easily with:

A method: df.with_row_index(<name>)
An expression: pl.int_range(pl.len()).alias(<name>)

After that we can use this predicate:

pl.first("index").over("DayOfWeek") == pl.col("index")

Finally we use a when -> then -> otherwise expression to use that condition and create our new "Risk" column.

Example

Let's start with some data. In the snippet below I create an hourly date range and then determine the weekdays from that.

Preparing data

df = pl.DataFrame({
    "Time": pl.datetime_range(pl.datetime(2022, 6, 1), pl.datetime(2022, 6, 30), "1h", eager=True).sample(fraction=1.5, with_replacement=True).sort(),
}).select(
    pl.int_range(pl.len()).alias("index"),
    pl.all(),
    pl.col("Time").dt.weekday().alias("DayOfWeek"),
)

print(df)

shape: (1_045, 3)
┌───────┬─────────────────────┬───────────┐
│ index ┆ Time                ┆ DayOfWeek │
│ ---   ┆ ---                 ┆ ---       │
│ i64   ┆ datetime[μs]        ┆ i8        │
╞═══════╪═════════════════════╪═══════════╡
│ 0     ┆ 2022-06-01 01:00:00 ┆ 3         │
│ 1     ┆ 2022-06-01 01:00:00 ┆ 3         │
│ 2     ┆ 2022-06-01 02:00:00 ┆ 3         │
│ 3     ┆ 2022-06-01 02:00:00 ┆ 3         │
│ 4     ┆ 2022-06-01 02:00:00 ┆ 3         │
│ …     ┆ …                   ┆ …         │
│ 1040  ┆ 2022-06-29 23:00:00 ┆ 3         │
│ 1041  ┆ 2022-06-29 23:00:00 ┆ 3         │
│ 1042  ┆ 2022-06-29 23:00:00 ┆ 3         │
│ 1043  ┆ 2022-06-30 00:00:00 ┆ 4         │
│ 1044  ┆ 2022-06-30 00:00:00 ┆ 4         │
└───────┴─────────────────────┴───────────┘

Computing Risk values

df.with_columns(
    pl.when(
        pl.first("index").over("DayOfWeek") == pl.col("index")
    ).then(
        pl.lit("High")
    ).otherwise(
        pl.lit("Low")
    ).alias("Risk")
).drop("index")

shape: (1_045, 3)
┌─────────────────────┬───────────┬──────┐
│ Time                ┆ DayOfWeek ┆ Risk │
│ ---                 ┆ ---       ┆ ---  │
│ datetime[μs]        ┆ i8        ┆ str  │
╞═════════════════════╪═══════════╪══════╡
│ 2022-06-01 01:00:00 ┆ 3         ┆ High │
│ 2022-06-01 01:00:00 ┆ 3         ┆ Low  │
│ 2022-06-01 02:00:00 ┆ 3         ┆ Low  │
│ 2022-06-01 02:00:00 ┆ 3         ┆ Low  │
│ 2022-06-01 02:00:00 ┆ 3         ┆ Low  │
│ …                   ┆ …         ┆ …    │
│ 2022-06-29 23:00:00 ┆ 3         ┆ Low  │
│ 2022-06-29 23:00:00 ┆ 3         ┆ Low  │
│ 2022-06-29 23:00:00 ┆ 3         ┆ Low  │
│ 2022-06-30 00:00:00 ┆ 4         ┆ Low  │
│ 2022-06-30 00:00:00 ┆ 4         ┆ Low  │
└─────────────────────┴───────────┴──────┘