Search code examples
pythonpandasdataframeloopsgroupingby

How to reduce a DataFrame to two DataFrames then sum rows of each data frame into new data frame


I have a dataframe:

sensor_id timestamp current apparent_power
0 2024-02-01 00:00:00.944369920 1.550 101.5
1 2024-02-01 00:00:00.959425024 0.284 20.4
0 2024-02-01 00:00:01.945563136 1.549 101.6
1 2024-02-01 00:00:01.950393856 0.286 20.2
0 2024-02-01 00:00:02.944880896 1.547 100.6
1 2024-02-01 00:00:02.949997056 0.290 21.6
0 2024-02-01 00:00:03.944605184 1.547 100.9
1 2024-02-01 00:00:03.949342976 0.290 21.8
0 2024-02-01 00:00:04.944967936 1.548 100.9
1 2024-02-01 00:00:04.950715136 0.285 20.5
0 2024-02-01 00:00:05.944571904 1.547 100.7
1 2024-02-01 00:00:05.952841984 0.284 20.2
0 2024-02-01 00:00:06.945287936 1.548 100.8
1 2024-02-01 00:00:06.950289920 0.287 20.8
0 2024-02-01 00:00:07.944841984 1.547 100.7
1 2024-02-01 00:00:07.962969088 0.290 21.8
0 2024-02-01 00:00:08.945434112 1.550 103.1
1 2024-02-01 00:00:08.950361088 0.281 20.1
0 2024-02-01 00:00:09.944396032 1.551 103.8
1 2024-02-01 00:00:09.949807872 0.285 21.1

My end aim to so push sensor 1 into sensor 2 line, summing the row of each only to create the two row total:

timestamp_0 timestamp_1 total_current total_apparent_power
2024-02-01 00:00:00.944369920 2024-02-01 00:00:00.959425024 1.834 121.9
2024-02-01 00:00:01.945563136 2024-02-01 00:00:01.950393856 1.835 121.6
2024-02-01 00:00:02.944880896 2024-02-01 00:00:02.949997056 xxx xxx
2024-02-01 00:00:03.944605184 2024-02-01 00:00:03.949342976 xxx xxx
2024-02-01 00:00:04.944967936 2024-02-01 00:00:04.950715136 xxx xxx
2024-02-01 00:00:05.944571904 2024-02-01 00:00:05.952841984 xxx xxx
2024-02-01 00:00:06.945287936 2024-02-01 00:00:06.950289920 xxx xxx
2024-02-01 00:00:07.944841984 2024-02-01 00:00:07.962969088 xxx xxx
2024-02-01 00:00:08.945434112 2024-02-01 00:00:08.950361088 xxx xxx
2024-02-01 00:00:09.944396032 2024-02-01 00:00:09.949807872 xxx xxx

I have took the timestamps and grouped them by second and hour to produce:

sensor_id timestamp current apparent_power seconds hour
0 2024-02-01 00:00:00.944369920 1.550 101.5 0 0
1 2024-02-01 00:00:00.959425024 0.284 20.4 0 0
0 2024-02-01 00:00:01.945563136 1.549 101.6 1 0
1 2024-02-01 00:00:01.950393856 0.286 20.2 1 0
0 2024-02-01 00:00:02.944880896 1.547 100.6 2 0
1 2024-02-01 00:00:02.949997056 0.290 21.6 2 0
0 2024-02-01 00:00:03.944605184 1.547 100.9 3 0
1 2024-02-01 00:00:03.949342976 0.290 21.8 3 0
0 2024-02-01 00:00:04.944967936 1.548 100.9 4 0
1 2024-02-01 00:00:04.950715136 0.285 20.5 4 0
0 2024-02-01 00:00:05.944571904 1.547 100.7 5 0
1 2024-02-01 00:00:05.952841984 0.284 20.2 5 0
0 2024-02-01 00:00:06.945287936 1.548 100.8 6 0
1 2024-02-01 00:00:06.950289920 0.287 20.8 6 0
0 2024-02-01 00:00:07.944841984 1.547 100.7 7 0
1 2024-02-01 00:00:07.962969088 0.290 21.8 7 0
0 2024-02-01 00:00:08.945434112 1.550 103.1 8 0
1 2024-02-01 00:00:08.950361088 0.281 20.1 8 0
0 2024-02-01 00:00:09.944396032 1.551 103.8 9 0
1 2024-02-01 00:00:09.949807872 0.285 21.1 9 0

I have something that starts to do it, but falls over if the data has a missing sensor reading:

second_tracker = 0
a_row = 0
b_row = 0

for i, row in test_file_df.iterrows():
   
   if row['seconds'] != second_tracker:
       second_tracker += 1
       # store totals
       a_cur = a_row['current']
       b_cur = b_row['current']
       total_current = a_cur + b_cur
       
       a_app_power = a_row['apparent_power']
       b_app_power = b_row['apparent_power']
       total_app_power = a_app_power + b_app_power
       
       new_row = {'timestamp_0': a_row['timestamp'],'timestamp_1': b_row['timestamp'], 'total_current': total_current, 'total_app_power': total_app_power}
       print(new_row)
       power_sum_df.loc[len(power_sum_df)] = new_row
       a_row = 0
       b_row = 0
   
   if row['sensor_id'] == 0:
       a_row = row
   else:
       b_row = row

It stops when there is no matching sensor row create due to sensor missing - i.e. row is sensor 0, next row is also sensor 0 as previous sensor 1 was lost.

I have tried splitting into to two DF and tries to sum row 1 + row 1 but could understand how to add the two rows into a new data frame and sum only two columns while copying up the next row timestamp.

Can anyone help - even a better way of pulling out the alternate rows into one row. I tried to use unstack, group and agg to no avail.

Update***** data shows irregular records within the single time second:

line no sensor_id timestamp current apparent_power
4 0 2024-01-25 00:00:02.977937920 1.502 370.9
5 1 2024-01-25 00:00:02.983158016 0.293 72.4
6 0 2024-01-25 00:00:03.977388032 1.503 371.1
7 1 2024-01-25 00:00:03.995464960 0.284 70.3
8 0 2024-01-25 00:00:04.978688000 1.500 370.6
9 1 2024-01-25 00:00:05.025767168 0.287 70.9
10 0 2024-01-25 00:00:05.980681984 1.495 369.0
11 1 2024-01-25 00:00:06.031785984 0.286 70.7
12 0 2024-01-25 00:00:06.977396992 1.497 369.7
13 1 2024-01-25 00:00:06.984870912 0.280 69.3
14 0 2024-01-25 00:00:07.979064832 1.495 369.1
15 1 2024-01-25 00:00:07.983921920 0.282 69.6

Desired Result: There is logic applied where there are more than 2 records in a single second - always taking one of sensor0 and one of sensor 1. If there is only a single sensor reading within the second, then use the previous second corresponding sensor value for the sum.

lines joined sensor_0 sensor_1 timestamp_0 timestamp_1 total_current total_apparent_power Comment
4 & 5 0 1 2024-01-25 00:00:02.977937920 2024-01-25 00:00:02.983158016 1.795 443.2 Normal
6 & 7 0 1 2024-01-25 00:00:03.977388032 2024-01-25 00:00:03.995464960 1.787 441.4 Normal
8 & 7 0 1 2024-01-25 00:00:04.978688000 use line 7 sensor 1 as no sensor 1 in this second 1.784 440.9 use previous sensor 1
10 & 9 0 1 2024-01-25 00:00:05.980681984 2024-01-25 00:00:05.025767168 1.782 439.9 swap so ts_0 is sensor_0
12 & 13 0 1 2024-01-25 00:00:06.977396992 2024-01-25 00:00:06.984870912 1.777 439.0 use last sens0, and last sens1 of second
14 & 15 0 1 2024-01-25 00:00:07.979064832 2024-01-25 00:00:07.983921920 1.777 438.7 Normal

I have also modified the group for hour, min & second to put it in the grouping I am after.

power_agg_df = (test_file_df.groupby([ts.dt.hour, ts.dt.minute, ts.dt.second], sort=False)
               .agg(** {
                   "sensor_0" : ("sensor_id", "first"),
                   "sensor_1" : ("sensor_id", "last"),
                   "timestamp_0" : ("timestamp", "first"),
                   "timestamp_1" : ("timestamp", "last"),
                   "total_current" : ("current", "sum"),
                   "total_apparent_power" : ("apparent_power", "sum"),
                   
               }))

Solution

  • Instead of a loop, you can use the (hour/second) as a grouper to agg the operations you need :

    ts = pd.to_datetime(df["timestamp"])
    grp = {"hour": ts.dt.hour, "second": ts.dt.second} # +more ?
    # grp = {"time": ts.dt.floor("s").dt.time} # 3-components of time
    
    #1 reduce the groups with more than two identical censors
    uniq = (df.assign(**grp).sort_values([*grp, "sensor_id"]).loc[
            lambda x: ~x.duplicated(subset=[*grp, "sensor_id"], keep="last")])
    
    #2 look for the previous cursors to complete the groups having a single one
    miss = df.groupby(list(grp.values()))["sensor_id"].transform("nunique").eq(1)
    exte = (uniq.mask(miss | uniq["sensor_id"].eq(uniq["sensor_id"].shift(-1)))
               .ffill().loc[miss].convert_dtypes())#.assign(timestamp=None))
    
    #3 swap the censors in the groups with a leading censor-1
    cln = (pd.concat([uniq, exte if not exte.empty else None])
             .sort_values(["sensor_id", *grp]))
    out = (
            cln.astype({"line no": "str"})
            .groupby(list(grp.values()))
                .agg(**{
                    # feel free to comment (#) any agg
                "lines_joined": ("line no", " & ".join),
                "sensor_0": ("sensor_id", "first"),
                "sensor_1": ("sensor_id", "last"),
                "timestamp_0": ("timestamp", "first"),
                "timestamp_1": ("timestamp", "last"),
                "total_current": ("current", "sum"),
                "total_apparent_power": ("apparent_power", "sum")
            })
         .reset_index(names=[*grp]) # to show up the groups
        #.reset_index(drop=True) # or just drop them if optional
    )
    

    enter image description here