I have a Dataframe "timeseries" which has datetimes as its index and I have a PeriodIndex "on":
import numpy as np
import pandas as pd
timeseries = pd.DataFrame(
index=pd.DatetimeIndex(
[
"2000-01-01 12:00:00Z",
"2000-01-01 13:00:00Z",
"2000-01-01 14:00:00Z",
"2000-01-02 13:00:00Z",
"2000-01-02 18:00:00Z",
"2000-01-03 14:00:00Z",
"2000-01-03 20:00:00Z",
"2000-01-04 13:00:00Z",
]
),
data={
"value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
},
)
on = pd.PeriodIndex(
["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
)
I would like to add a column to "timeseries" that contains the period in "on" that each respective datetime is in:
value1 period
2000-01-01 12:00:00+00:00 6.0 2000-01-01
2000-01-01 13:00:00+00:00 5.0 2000-01-01
2000-01-01 14:00:00+00:00 3.0 2000-01-01
2000-01-02 13:00:00+00:00 7.0 2000-01-02
2000-01-02 18:00:00+00:00 4.0 2000-01-02
2000-01-03 14:00:00+00:00 4.0 NaN
2000-01-03 20:00:00+00:00 5.0 NaN
2000-01-04 13:00:00+00:00 3.0 2000-01-04
So far I have achieved this with a for-loop:
timeseries["period"] = np.NaN
for period in on:
datetimes_in_period = timeseries.index[
(timeseries.index >= period.start_time.tz_localize("UTC"))
& (timeseries.index <= period.end_time.tz_localize("UTC"))
]
timeseries["period"].loc[datetimes_in_period] = period
For efficiency's sake I want to avoid loops in Python. How can I vectorize this code?
You can still use .merge_asof
as @rorshan suggested.
If you create a dataframe of the start/end intervals:
df_on = pd.DataFrame({
"period": on,
"start_time": on.start_time.tz_localize("UTC"),
"end_time": on.end_time.tz_localize("UTC"),
})
df = pd.merge_asof(timeseries, df_on, left_index=True, right_on="start_time")
# blank out period when not inside
df["period"] = df["period"].where((df.index >= df["start_time"]) & (df.index <= df["end_time"]))
>>> df[["value1", "period"]]
value1 period
2000-01-01 12:00:00+00:00 6.0 2000-01-01
2000-01-01 13:00:00+00:00 5.0 2000-01-01
2000-01-01 14:00:00+00:00 3.0 2000-01-01
2000-01-02 13:00:00+00:00 7.0 2000-01-02
2000-01-02 18:00:00+00:00 4.0 2000-01-02
2000-01-03 14:00:00+00:00 4.0 NaT
2000-01-03 20:00:00+00:00 5.0 NaT
2000-01-04 13:00:00+00:00 3.0 2000-01-04