I have problems working with datetime
my dataset is like this:
import datetime
import polars as pl
df = pl.DataFrame({
'customer_unique_id': ['a', 'b', 'c', 'd', 'e'],
'order_purchase_timestamp': [[datetime.datetime(2018, 8, 13, 9, 14, 7),
datetime.datetime(2018, 7, 28, 0, 23, 49)],
[datetime.datetime(2017, 7, 27, 14, 13, 3),
datetime.datetime(2018, 1, 14, 7, 36, 54)],
[datetime.datetime(2017, 9, 1, 12, 11, 23),
datetime.datetime(2018, 5, 26, 19, 42, 48)],
[datetime.datetime(2018, 2, 19, 17, 11, 34),
datetime.datetime(2018, 3, 22, 18, 9, 41)],
[datetime.datetime(2018, 5, 23, 20, 14, 21),
datetime.datetime(2018, 6, 3, 10, 12, 57)]]
})
I wish to sort the dates inside the list on every row, and get the first and second datetime, because some rows has more than 2 values.
I'm trying with:
df.filter(
pl.col("order_purchase_timestamp").list.sort()
)
but I got this error:
ComputeError: filter predicate must be of type
Boolean
, gotlist[datetime[μs]]
You can use .with_columns()
context to select & modify columns.
col = pl.col("order_purchase_timestamp")
df.with_columns(
col.list.sort(descending=True).list.slice(0, 2)
) # / /
# sort by date take slice [0:2]
Solution that follows from the discussion in the comments
To perform some action under column based on condition, you can use pl.when -> then -> otherwise
construct. pl.when()
takes some boolean Series (condition). Read more here.
col = pl.col("order_purchase_timestamp")
# duration (difference) between 2 dates (in microseconds)
diff = pl.duration(microseconds=(col.list.get(0) - col.list.get(1)))
df.filter(col.list.len() >= 2).with_columns(
col.list.sort(descending=True).list.slice(0, 2)
).with_columns(
less_30 = pl.when(diff <= pl.duration(days=30))\
.then(True).otherwise(False)
)