I have a pd.DataFrame that looks like this:
datetime ... month year seconds
0 2023-03-02 20:59:00 ... 3 2023 75540
1 2023-03-02 20:58:00 ... 3 2023 75480
2 2023-03-02 20:57:00 ... 3 2023 75420
3 2023-03-02 20:56:00 ... 3 2023 75360
4 2023-03-02 20:55:00 ... 3 2023 75300
.. ... ... ... ... ...
775 2023-03-01 14:34:00 ... 3 2023 52440
776 2023-03-01 14:33:00 ... 3 2023 52380
777 2023-03-01 14:32:00 ... 3 2023 52320
778 2023-03-01 14:31:00 ... 3 2023 52260
779 2023-03-01 14:30:00 ... 3 2023 52200
[780 rows x 11 columns]
The datetime object can only contain values that are between 14:30:00 and 20:59:00. If some values are outside of this range, they need to be deleted. Also, the gap between two rows should be exactly 60 seconds every time, except when changing from one day to another (when df['seconds']== 52200). Notice that the data can span over multiple days.
The first row of the df does not have to be 20:59:00, it could start at 16:30:00 for example. The same goes for the last row, meaning it is not always 14:30:00. Some rows are randomly missing, and I want to add them in.
The datetime, month, year, seconds columns should be the missing times, while the values in the other columns should be taking the value between the two rows surrounding it. Say for example we have :
datetime x month year seconds
299 2023-03-02 18:59:00 200 3 2023 68340
300 2023-03-02 18:56:00 230 3 2023 68160
I need to insert two rows (as the difference in seconds is 180, we have 2 missing minutes). The x column for each of these rows should be respectively 210 and 220. I tried doing this using a while loop, but it's extremely inefficient - as I sometimes have df with over 400'000 rows. Is there a way to do this using the pandas library, or anything else that would be more efficient than iterating over each pair of row, checking how many rows have to be included, then inserting and reseting the index?
Use:
#convert to datetime
df['datetime'] = pd.to_datetime(df['datetime'])
#create MultiIndex by dates and seconds columns
out = df.set_index([df['datetime'].dt.normalize(), 'seconds'])
#create full MultiIndex by all combinations of all seconds and all existing dates
mux = pd.MultiIndex.from_product([out.index.levels[0], range(52200, 75540 +60, 60)],
names=['date','seconds'])
#add missing rows
out = out.reindex(mux).reset_index()
#if necessary remove per dates rows before minimal seconds and after maximal seconds
mask = out['datetime'].notna()
out = out[mask.groupby(out['date']).cummax() &
mask.iloc[::-1].groupby(out['date'].iloc[::-1]).cummax().iloc[::-1]]
#interpolate x column
out['x'] = out['x'].interpolate()
#extract monts, years, count datetimes from seconds
out['month'] = out['date'].dt.month
out['year'] = out['date'].dt.year
out['datetime'] = out['date'] + pd.to_timedelta(out['seconds'], 'seconds')