Search code examples
pythonpandasdataframedatetimediscretization

How to discretize a datetime column?


I have a dataset that contains a column of datetime of a month, and I need to divide it into two blocks (day and night or am\pm) and then discretize the time in each block into 10mins bins. I could add another column of 0 and 1 to show it is am or pm, but I cannot discretize it! Can you please help me with it?

df['started_at'] = pd.to_datetime(df['started_at'])
df['start hour'] = df['started_at'].dt.hour.astype('int')
df['mor/aft'] = np.where(df['start hour'] < 12, 1, 0)
df['started_at']

0          16:05:36
2          06:22:40
3          16:08:10
4          12:28:57
6          15:47:30
             ...   
3084526    15:24:24
3084527    16:33:07
3084532    14:08:12
3084535    09:43:46
3084536    17:02:26

Solution

  • If I understood correctly you are trying to add a column for every interval of ten minutes to indicate if an observation is from that interval of time.

    You can use lambda expressions to loop through each observation from the series.

    Dividing by 10 and making this an integer gives the first digit of the minutes, based on which you can add indicator columns.

    I also included how to extract the day indicator column with a lambda expression for you to compare. It achieves the same as your np.where().

    import pandas as pd
    from datetime import datetime
    
    # make dataframe
    df = pd.DataFrame({
        'started_at': ['14:20:56', 
                       '00:13:24', 
                       '16:01:33']
    })
    
    # convert column to datetime
    df['started_at'] = pd.to_datetime(df['started_at'])
    
    # make day indicator column
    df['day'] = df['started_at'].apply(lambda ts: 1 if ts.hour > 12 else 0)
    
    # make indicator column for every ten minutes
    for i in range(24):
        for j in range(6):
            col = 'hour_' + str(i) + '_min_' + str(j) + '0'
            df[col] = df['started_at'].apply(lambda ts: 1 if int(ts.minute/10) == j and ts.hour == i else 0)
    
    print(df)
    

    Output first columns:

               started_at  day  hour_0_min_00  hour_0_min_10  hour_0_min_20  
    0 2021-11-21 14:20:56    1              0              0              0   
    1 2021-11-21 00:13:24    0              0              1              0   
    2 2021-11-21 16:01:33    1              0              0              0   
    ...
    ...
    ...