In this example i have three columns, the 'DayOfWeek' Time' and the 'Risk'.
I want to group by 'DayOfWeek' and take the first element only and assign a high risk on it. This means the first known hour in day of week is the one that has the highest risk. The rest is initialized to 'Low' risk.
In pandas i had an additional column for the index, but in polars i do not. I could artificially create one, but is it even necessary?
Can i do this somehow smarter with polars?
df['risk'] = "Low"
df = df.sort('Time')
df.loc[df.groupby("DayOfWeek").head(1).index, "risk"] = "High"
The index is unique in this case and goes to range(n)
Here is my solution btw. (I don't really like it)
df = df.with_columns(pl.int_range(df.shape[0]).alias('pseudo_index')
# find lowest time for day
indexes_df = df.sort('Time').group_by('DayOfWeek').head(1)
# Set 'High' as col for all rows from group_by
indexes_df = indexes_df.select('pseudo_index').with_columns(pl.lit('High').alias('risk'))
# Left join will generate null values for all values that are not in indexes_df 'pseudo_index'
df = df.join(indexes_df, how='left', on='pseudo_index').select(
pl.exclude('pseudo_index', 'risk'), pl.col('risk').fill_null(pl.lit('low'))
)
You can use window
functions to find where the first
"index"
of the "DayOfWeek"
group equals the"index"
column.
For that we only need to set an "index"
column. We can do that easily with:
df.with_row_index(<name>)
pl.int_range(pl.len()).alias(<name>)
After that we can use this predicate:
pl.first("index").over("DayOfWeek") == pl.col("index")
Finally we use a when -> then -> otherwise
expression to use that condition and create our new "Risk"
column.
Let's start with some data. In the snippet below I create an hourly date range and then determine the weekdays from that.
df = pl.DataFrame({
"Time": pl.datetime_range(pl.datetime(2022, 6, 1), pl.datetime(2022, 6, 30), "1h", eager=True).sample(fraction=1.5, with_replacement=True).sort(),
}).select(
pl.int_range(pl.len()).alias("index"),
pl.all(),
pl.col("Time").dt.weekday().alias("DayOfWeek"),
)
print(df)
shape: (1_045, 3)
┌───────┬─────────────────────┬───────────┐
│ index ┆ Time ┆ DayOfWeek │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ i8 │
╞═══════╪═════════════════════╪═══════════╡
│ 0 ┆ 2022-06-01 01:00:00 ┆ 3 │
│ 1 ┆ 2022-06-01 01:00:00 ┆ 3 │
│ 2 ┆ 2022-06-01 02:00:00 ┆ 3 │
│ 3 ┆ 2022-06-01 02:00:00 ┆ 3 │
│ 4 ┆ 2022-06-01 02:00:00 ┆ 3 │
│ … ┆ … ┆ … │
│ 1040 ┆ 2022-06-29 23:00:00 ┆ 3 │
│ 1041 ┆ 2022-06-29 23:00:00 ┆ 3 │
│ 1042 ┆ 2022-06-29 23:00:00 ┆ 3 │
│ 1043 ┆ 2022-06-30 00:00:00 ┆ 4 │
│ 1044 ┆ 2022-06-30 00:00:00 ┆ 4 │
└───────┴─────────────────────┴───────────┘
df.with_columns(
pl.when(
pl.first("index").over("DayOfWeek") == pl.col("index")
).then(
pl.lit("High")
).otherwise(
pl.lit("Low")
).alias("Risk")
).drop("index")
shape: (1_045, 3)
┌─────────────────────┬───────────┬──────┐
│ Time ┆ DayOfWeek ┆ Risk │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i8 ┆ str │
╞═════════════════════╪═══════════╪══════╡
│ 2022-06-01 01:00:00 ┆ 3 ┆ High │
│ 2022-06-01 01:00:00 ┆ 3 ┆ Low │
│ 2022-06-01 02:00:00 ┆ 3 ┆ Low │
│ 2022-06-01 02:00:00 ┆ 3 ┆ Low │
│ 2022-06-01 02:00:00 ┆ 3 ┆ Low │
│ … ┆ … ┆ … │
│ 2022-06-29 23:00:00 ┆ 3 ┆ Low │
│ 2022-06-29 23:00:00 ┆ 3 ┆ Low │
│ 2022-06-29 23:00:00 ┆ 3 ┆ Low │
│ 2022-06-30 00:00:00 ┆ 4 ┆ Low │
│ 2022-06-30 00:00:00 ┆ 4 ┆ Low │
└─────────────────────┴───────────┴──────┘