Search code examples
pythonpandasloopsdataframebinning

panda df iteration, binning of data based on time in milliseconds


I have refocused my questions and have tried to be as specific as possible. below, I also include code I have used so far;

(1) When pulling data from SQL, my time is in a mixed format that contains a letter which is hard to work with. To avoid issues with that, i tried to apply; df.time=pd.to_timedelta(df.time, unit='ms'), which is fine by dont know how to extract the hours and minutes. Example;2019.11.22D01:18:00.01000, i just need to have column 'time' in following format; '01:18:00.01000'. Maybe i can use 'np.datetime64' to convert all my SQL time entries to the desired format and truncate the amount of characters I need? Please advise team. I also tried 'data=np.datetime64('time') but getting ' Error parsing datetime string "time" at position 0 '.

(2) I am attempting to group my data below by 2 factors, firstly , by 'data2' and then by 'time'. This is because my data will not be in the order below but rather in a random order. I get: ' DataFrameGroupBy' is not callable. Is that because i have repeating data2 values? Could you please help with what is causing this?

(3) So after I have grouped my data by 'data2' and 'time', I then need to bin the data within predefined time intervals (i.e. [0=10ms), [10-20ms) etc), so rows 0,1,2 will fall under the [0-10ms) bin, for example. Thus, I need to be able to define these bins first ( I will have a fixed set of bins ). Then ,for the next 'data2' change (i.e. from 55 to 56 lets say), we set start time as 0 and bin the row data based on time elapsed from 0 until data2 changes again. And so on. How can I code this, where I struggle the most is setting timer to '0' and referencing 'time' for every row as long as 'data2' value hasn't changed. Then when 'data2' changes, start all over, binning data accordingly.

Below is the code I have used so far;

import pyodbc 
import pandas as pd
import numpy as np

conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=XXXXXXXXX;'
                      'Database=Dynamics;'
                      'Trusted_Connection=yes;')

cursor = conn.cursor()

SQL_Query = pd.read_sql_query('''select ID,time,data1,data2,data3,data4,data5 from Dynamics''', conn)
df = pd.DataFrame(SQL_Query, columns=['ID','time','data2','data3','data4','data5'])
df.time=pd.to_timedelta(df.time, unit='ms')
df[['data4']] = df[['data4']].apply(pd.to_numeric)
df['diff']=df['data4']-df['data5']
df['diff']=df['diff'].abs()
df=df.groupby(['data3','time'])
print(df)



                     time data_1  data_2 data_3  data_4  data_5
0 2019-11-22 01:18:00.010      a      55      A    1.20    1.24
1 2019-11-22 01:18:00.090      a      55      B    1.25    1.24
2 2019-11-22 01:18:00.100      a      55      C    1.26    1.24
3 2019-11-22 01:18:00.140      a      55      A    1.22    1.22
4 2019-11-22 01:18:00.160      a      55      B    1.23    1.22

Solution

  • Pandas has a great feature of date ranges. Here is an example that creates a one-minute range, with a new millisecond on each row (which is also the index).

    import pandas as pd
    from datetime import timedelta
    import numpy as np
    
    date_rng = pd.date_range(start='2019-11-22T01:18:00.00100', end='2019-11-22T01:19:00.00000', freq='ms') #one minute, in milliseconds
    n = len(date_rng) # n = 60000
    values = np.random.random(n) # make n random numbers
    
    df = pd.DataFrame({'values': values}, index=date_rng)
    print ('dataframe: ')
    print (df.head())
    

    This is the head of df:

    dataframe: 
                               values
    2019-11-22 01:18:00.001  0.914796
    2019-11-22 01:18:00.002  0.760555
    2019-11-22 01:18:00.003  0.132992
    2019-11-22 01:18:00.004  0.572391
    2019-11-22 01:18:00.005  0.090188
    

    Next, Pandas has a nice resample feature which, in this example, sums the values in 10 ms bins.

    df2 = df.resample(rule=timedelta(milliseconds=10)).sum() # df2 sums the values in 10 ms bins
    print ('beginning of df2')
    print (df2.head())
    print ('...')
    print (df2.tail())
    

    Here is the output:

    beginning of df2
                               values
    2019-11-22 01:18:00.000  5.236037
    2019-11-22 01:18:00.010  4.446964
    2019-11-22 01:18:00.020  6.549635
    2019-11-22 01:18:00.030  5.141522
    2019-11-22 01:18:00.040  5.375919
    ...
                               values
    2019-11-22 01:18:59.960  3.876523
    2019-11-22 01:18:59.970  4.864252
    2019-11-22 01:18:59.980  5.690987
    2019-11-22 01:18:59.990  2.787247
    2019-11-22 01:19:00.000  0.613545
    

    Note that the last value is much smaller, as only 1 ms is represented.