Search code examples
pandasdataframegroup-byaverage

Taking average of Dataframe, Object values not printing


I have a Data frame as follows:

Date Time U V
2023-01-03 10:00:21:08 5 8
2023-01-03 10:00:25:08 2 3
2023-01-03 10:00:27:08 3 3
2023-01-03 10:00:29:08 6 6
2023-01-03 10:00:31:08 4 6
2023-01-03 10:00:35:08 9 7
2023-01-03 10:00:37:08 10 4
2023-01-03 10:00:39:08 11 12

I am trying to average 2 values and printing the results for the averages of the Date, Time, U and V.

I have done this:

data_columns=['Date','Time','U','V']
df3= data.groupby(np.arange(len(data))//2).mean()
print(df3)

However, my numeric values average and print, but my Date and Time don't. I know thats because df.groupby function only works on the numeric values and the Date and Time types are objects. How do I get the date and time to print such that the initial time prints for the averaging value. Like so:

Date | Time |U  | V|
2023-01-03 10:00:21:08 3.5 5.5 
2023-01-03 10:00:27:08 4.5 4.5
2023-01-03 10:00:31:08 6.5 6.5
2023-01-03 10:00:37:08 10.5 8

` ** in reality its ...np.arrange(len(data)//300).mean()


Solution

  • IIUC, the groupby() function uses with a column name or a list of column name like

    groupby('Date') or groupby(['Date','Time'])

    UPDATE After discuss with @doctorstrange, I would like to update the answer. The code below try to create a dummy data:

    import pandas as pd
    from datetime import datetime
    import time
    import random
    logfile='log.csv'
    c=0
    
    with open(logfile,'a') as handle:
        handle.write('Date,Time,U,V\n')
        while c<=100:
            now=datetime.now()
            D=now.strftime('%Y-%m-%d')
            T=now.strftime('%H:%M:%S:%f')
            u=random.randint(1, 20)
            v=random.randint(1, 20)
            handle.write(f'{D},{T},{u},{v}\n')
            c=c+1
            time.sleep(1)
    

    The data is the pandas dataframe, like this

    Date Time U V
    0 2023-01-13 11:31:43:481765 15 8
    1 2023-01-13 11:31:44:485698 13 11
    2 2023-01-13 11:31:45:489411 7 11
    3 2023-01-13 11:31:46:492592 6 11
    4 2023-01-13 11:31:47:496405 18 17
    ... ... ... ... ...
    96 2023-01-13 11:33:19:812832 10 5
    97 2023-01-13 11:33:20:812896 5 15
    98 2023-01-13 11:33:21:816767 16 9
    99 2023-01-13 11:33:22:818553 6 8
    100 2023-01-13 11:33:23:818651 13 6

    The data was writen into logfile, 'log.csv' file. Read into the dataframe. data variable.

    data=pd.read_csv(logfile)
    

    Create a new column timestamp by apply the pd.Timestamp.combine() which combine Date and Time.

    data['timestamp']=data.apply(lambda x: pd.Timestamp.combine(
         datetime.strptime(x['Date'],'%Y-%m-%d')
        ,datetime.strptime(x['Time'],'%H:%M:%S:%f').time()
       ), axis=1)
    

    You will get this,

    Date Time U V timestamp
    0 2023-01-13 11:31:43:481765 15 8 2023-01-13 11:31:43.481765
    1 2023-01-13 11:31:44:485698 13 11 2023-01-13 11:31:44.485698
    2 2023-01-13 11:31:45:489411 7 11 2023-01-13 11:31:45.489411
    3 2023-01-13 11:31:46:492592 6 11 2023-01-13 11:31:46.492592
    4 2023-01-13 11:31:47:496405 18 17 2023-01-13 11:31:47.496405
    ... ... ... ... ... ...
    96 2023-01-13 11:33:19:812832 10 5 2023-01-13 11:33:19.812832
    97 2023-01-13 11:33:20:812896 5 15 2023-01-13 11:33:20.812896
    98 2023-01-13 11:33:21:816767 16 9 2023-01-13 11:33:21.816767
    99 2023-01-13 11:33:22:818553 6 8 2023-01-13 11:33:22.818553
    100 2023-01-13 11:33:23:818651 13 6 2023-01-13 11:33:23.818651

    Then, as you use df3 variable, the code below use resample() function with rule='min' (every minute) on the timestamp column and set the offset by 0s (zero second). Then calculate the mean() and then reset the dataframe index, set the name of the column to U_mean

    df3=data.resample(rule='min', on='timestamp', offset='0s').U.mean().reset_index(name='U_mean')
    df3
    

    The first result like this

    timestamp U_mean
    0 2023-01-13 11:31:00 9.823529
    1 2023-01-13 11:32:00 10.783333
    2 2023-01-13 11:33:00 8.250000

    Lastly, for df4 for V mean

    df4=data.resample(rule='min', on='timestamp', offset='0s').V.mean().reset_index(name='V_mean')
    df4
    

    Then merge df3 and df4 on timestamp you will get the final result

    pd.merge(df3,df4, on='timestamp')
    

    Finally, the result is

    timestamp U_mean V_mean
    0 2023-01-13 11:31:00 9.823529 11.117647
    1 2023-01-13 11:32:00 10.783333 10.716667
    2 2023-01-13 11:33:00 8.250000 12.541667

    Hope this help.