Say i have a polars dataframe:
import polars as pl
df = pl.DataFrame({'index': [1,2,3,2,1],
'object': [1, 1, 1, 2, 2],
'period': [1, 2, 4, 4, 23],
'value': [24, 67, 89, 5, 23],
})
How would i get dict of index -> to last value?
df.col('value').last().over('index').alias('last')
is the last value but that requires a lot of extra computation and more work to get to the key value pairs.
The over
function will keep all rows, which is probably not what you want. An easy way to get just the last value
for index
is to use unique
.
(
df
.select("index", "value")
.unique(subset="index", keep="last")
)
shape: (3, 2)
┌───────┬───────┐
│ index ┆ value │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═══════╡
│ 1 ┆ 23 │
│ 2 ┆ 5 │
│ 3 ┆ 89 │
└───────┴───────┘
From this point, you can use the to_dicts
method to convert the DataFrame to a list of dictionaries.
last_values = (
df
.select("index", "value")
.unique(subset="index", keep="last")
.to_dicts()
)
last_values
[{'index': 1, 'value': 23}, {'index': 2, 'value': 5}, {'index': 3, 'value': 89}]
If you are looking to later import this into a DataFrame, you'll want to stop at this point. For example:
pl.DataFrame(last_values)
>>> pl.DataFrame(last_values)
shape: (3, 2)
┌───────┬───────┐
│ index ┆ value │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═══════╡
│ 1 ┆ 23 │
│ 2 ┆ 5 │
│ 3 ┆ 89 │
└───────┴───────┘
However, if you want to collapse this into a single dictionary of index
:value
pairs, you can use a dictionary comprehension.
{
next_dict["index"]: next_dict["value"]
for next_dict in last_values
}
{1: 23, 2: 5, 3: 89}
Let's assume that we have this data:
import polars as pl
import datetime
df = pl.DataFrame({
"index": [1, 2, 3],
"value": [10, 20, 30],
}).join(
pl.DataFrame({
'date': pl.date_range(datetime.date(2021, 1, 1), datetime.date(2023, 1, 1), "1y", eager=True)
}),
how="cross"
)
df
shape: (9, 3)
┌───────┬───────┬────────────┐
│ index ┆ value ┆ date │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ date │
╞═══════╪═══════╪════════════╡
│ 1 ┆ 10 ┆ 2021-01-01 │
│ 1 ┆ 10 ┆ 2022-01-01 │
│ 1 ┆ 10 ┆ 2023-01-01 │
│ 2 ┆ 20 ┆ 2021-01-01 │
│ 2 ┆ 20 ┆ 2022-01-01 │
│ 2 ┆ 20 ┆ 2023-01-01 │
│ 3 ┆ 30 ┆ 2021-01-01 │
│ 3 ┆ 30 ┆ 2022-01-01 │
│ 3 ┆ 30 ┆ 2023-01-01 │
└───────┴───────┴────────────┘
And we have these values that we want to update.
update_df = pl.DataFrame({
"index": [2, 3],
"value": [200, 300],
})
update_df
shape: (2, 2)
┌───────┬───────┐
│ index ┆ value │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═══════╡
│ 2 ┆ 200 │
│ 3 ┆ 300 │
└───────┴───────┘
Note: I've purposely left out index
"1" (to show what will happen).
If we want to update the value
associated with each index
, but only beyond a certain date, we can use a join_asof
.
Since this is an advanced method, we'll take it in steps.
We'll add the current_date
to the update_df
as a literal. (The same value for all rows.)
We also need to make sure both our DataFrames are sorted by the "as_of" column (date
, not index
). (update_df
will already be sorted because its the same date on each row.)
I'll also sort after the join_asof
so that we can see what is happening more clearly. (You don't need to do this step.)
current_date = datetime.date(2022, 1, 1)
(
df
.sort("date")
.rename({"value": "prev_value"})
.join_asof(
update_df.with_columns(pl.lit(current_date).alias("date")),
on="date",
by=["index"],
strategy="backward"
)
.sort("index", "date")
)
shape: (9, 4)
┌───────┬────────────┬────────────┬───────┐
│ index ┆ prev_value ┆ date ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ date ┆ i64 │
╞═══════╪════════════╪════════════╪═══════╡
│ 1 ┆ 10 ┆ 2021-01-01 ┆ null │
│ 1 ┆ 10 ┆ 2022-01-01 ┆ null │
│ 1 ┆ 10 ┆ 2023-01-01 ┆ null │
│ 2 ┆ 20 ┆ 2021-01-01 ┆ null │
│ 2 ┆ 20 ┆ 2022-01-01 ┆ 200 │
│ 2 ┆ 20 ┆ 2023-01-01 ┆ 200 │
│ 3 ┆ 30 ┆ 2021-01-01 ┆ null │
│ 3 ┆ 30 ┆ 2022-01-01 ┆ 300 │
│ 3 ┆ 30 ┆ 2023-01-01 ┆ 300 │
└───────┴────────────┴────────────┴───────┘
Notice that only those rows with a date >= 2022-01-01 have a non-null value for value
. (I'll show how to do a > 2022-01-01 at the end.)
Next we'll use fill_null
to fill the null values in value
with the prev_value
column.
current_date = datetime.date(2022, 1, 1)
(
df
.sort("date")
.rename({"value": "prev_value"})
.join_asof(
update_df.with_columns(pl.lit(current_date).alias("date")),
on="date",
by=["index"],
strategy="backward"
)
.sort("index", "date")
.with_columns(pl.col("value").fill_null(pl.col("prev_value")))
)
shape: (9, 4)
┌───────┬────────────┬────────────┬───────┐
│ index ┆ prev_value ┆ date ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ date ┆ i64 │
╞═══════╪════════════╪════════════╪═══════╡
│ 1 ┆ 10 ┆ 2021-01-01 ┆ 10 │
│ 1 ┆ 10 ┆ 2022-01-01 ┆ 10 │
│ 1 ┆ 10 ┆ 2023-01-01 ┆ 10 │
│ 2 ┆ 20 ┆ 2021-01-01 ┆ 20 │
│ 2 ┆ 20 ┆ 2022-01-01 ┆ 200 │
│ 2 ┆ 20 ┆ 2023-01-01 ┆ 200 │
│ 3 ┆ 30 ┆ 2021-01-01 ┆ 30 │
│ 3 ┆ 30 ┆ 2022-01-01 ┆ 300 │
│ 3 ┆ 30 ┆ 2023-01-01 ┆ 300 │
└───────┴────────────┴────────────┴───────┘
Now, to clean up, we can drop the prev_value
column, and re-arrange the columns.
current_date = datetime.date(2022, 1, 1)
(
df
.sort("date")
.rename({"value": "prev_value"})
.join_asof(
update_df.with_columns(pl.lit(current_date).alias("date")),
on="date",
by=["index"],
strategy="backward"
)
.sort("index", "date")
.with_columns(pl.col("value").fill_null(pl.col("prev_value")))
.drop("prev_value")
.select(
pl.exclude("date"),
pl.col("date")
)
)
shape: (9, 3)
┌───────┬───────┬────────────┐
│ index ┆ value ┆ date │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ date │
╞═══════╪═══════╪════════════╡
│ 1 ┆ 10 ┆ 2021-01-01 │
│ 1 ┆ 10 ┆ 2022-01-01 │
│ 1 ┆ 10 ┆ 2023-01-01 │
│ 2 ┆ 20 ┆ 2021-01-01 │
│ 2 ┆ 200 ┆ 2022-01-01 │
│ 2 ┆ 200 ┆ 2023-01-01 │
│ 3 ┆ 30 ┆ 2021-01-01 │
│ 3 ┆ 300 ┆ 2022-01-01 │
│ 3 ┆ 300 ┆ 2023-01-01 │
└───────┴───────┴────────────┘
If you need to update only those rows that are strictly greater than current_date
, you can simply add one day to your current_date
. Polars makes this easy with the offset_by
expression.
(
df
.sort("date")
.rename({"value": "prev_value"})
.join_asof(
update_df.with_columns(pl.lit(current_date).dt.offset_by("1d").alias("date")),
on="date",
by=["index"],
strategy="backward"
)
.sort("index", "date")
.with_columns(pl.col("value").fill_null(pl.col("prev_value")))
.drop("prev_value")
.select(
pl.exclude("date"),
pl.col("date")
)
)
shape: (9, 3)
┌───────┬───────┬────────────┐
│ index ┆ value ┆ date │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ date │
╞═══════╪═══════╪════════════╡
│ 1 ┆ 10 ┆ 2021-01-01 │
│ 1 ┆ 10 ┆ 2022-01-01 │
│ 1 ┆ 10 ┆ 2023-01-01 │
│ 2 ┆ 20 ┆ 2021-01-01 │
│ 2 ┆ 20 ┆ 2022-01-01 │
│ 2 ┆ 200 ┆ 2023-01-01 │
│ 3 ┆ 30 ┆ 2021-01-01 │
│ 3 ┆ 30 ┆ 2022-01-01 │
│ 3 ┆ 300 ┆ 2023-01-01 │
└───────┴───────┴────────────┘