Search code examples
pythondataframemappingcategorical-data

Create a categorical column from a datetime column in Python


I have a dataframe with a column of datetime type that I'd like to create a 2nd categorical column based upon. The second column should consist of string identifiers using the following logic:

ON: [16:00-09:30)
FH: [09:30-10:00)
M: [10:00-15:00)
SLH: [15:00-15:30)
LH: [15:30-16:00) 

For instance, for dataframe d:

date_rng = pd.date_range(start='2024-01-01', periods=48, freq='30min')  
d = pd.DataFrame({'datetime': date_rng})
    

I'd like the output below:

d.head(40)
Out[186]: 
              datetime part
0  2024-01-01 00:00:00   ON
1  2024-01-01 00:30:00   ON
2  2024-01-01 01:00:00   ON
3  2024-01-01 01:30:00   ON
4  2024-01-01 02:00:00   ON
5  2024-01-01 02:30:00   ON
6  2024-01-01 03:00:00   ON
7  2024-01-01 03:30:00   ON
8  2024-01-01 04:00:00   ON
9  2024-01-01 04:30:00   ON
10 2024-01-01 05:00:00   ON
11 2024-01-01 05:30:00   ON
12 2024-01-01 06:00:00   ON
13 2024-01-01 06:30:00   ON
14 2024-01-01 07:00:00   ON
15 2024-01-01 07:30:00   ON
16 2024-01-01 08:00:00   ON
17 2024-01-01 08:30:00   ON
18 2024-01-01 09:00:00   FH
19 2024-01-01 09:30:00   FH
20 2024-01-01 10:00:00    M
21 2024-01-01 10:30:00    M
22 2024-01-01 11:00:00    M
23 2024-01-01 11:30:00    M
24 2024-01-01 12:00:00    M
25 2024-01-01 12:30:00    M
26 2024-01-01 13:00:00    M
27 2024-01-01 13:30:00    M
28 2024-01-01 14:00:00    M
29 2024-01-01 14:30:00    M
30 2024-01-01 15:00:00  SLH
31 2024-01-01 15:30:00   LH
32 2024-01-01 16:00:00   ON
33 2024-01-01 16:30:00   ON
34 2024-01-01 17:00:00   ON
35 2024-01-01 17:30:00   ON
36 2024-01-01 18:00:00   ON
37 2024-01-01 18:30:00   ON
38 2024-01-01 19:00:00   ON
39 2024-01-01 19:30:00   ON

Solution

  • Option 1: pd.cut

    base_time = d['datetime'].dt.normalize() + pd.Timedelta(hours=16)
    
    offset = (d['datetime'] - base_time).dt.total_seconds().mod(86400) # 24 * 3600
    
    bins = [hours * 3600 for hours in [0, 17.5, 18, 23, 23.5, 24]]
    
    labels = ['ON', 'FH', 'M', 'SLH', 'LH']
    
    d['part'] = pd.cut(offset, bins=bins, labels=labels, right=False)
    

    Output:

    # showing first row per continuous value
    d[d['part'].ne(d['part'].shift())]
    
                  datetime part
    0  2024-01-01 00:00:00   ON
    19 2024-01-01 09:30:00   FH
    20 2024-01-01 10:00:00    M
    30 2024-01-01 15:00:00  SLH
    31 2024-01-01 15:30:00   LH
    32 2024-01-01 16:00:00   ON
    

    with:

    d['part'].dtype
    
    CategoricalDtype(categories=['ON', 'FH', 'M', 'SLH', 'LH'], ordered=True, 
                     categories_dtype=object)
    

    Explanation

    • Use Series.dt.normalize + offset 16 hours with pd.Timedelta to set a base_time at 16:00.
    • Subtract from d['datetime'] and get Series.dt.total_seconds + Series.mod with 24 * 3600 (day in seconds).
    • Create bins with endpoints set against 16:00 as 0. E.g., 17.5 = 8 hours until end day + 9.5 hours next day (09:30).
    • Set labels and apply pd.cut with right=False.

    Option 2: np.select

    import numpy as np
    
    delta = (d['datetime'] - d['datetime'].dt.normalize()).dt.total_seconds()
    
    condlist = [
        (delta < hours * 3600) for hours in [9.5, 10, 15, 15.5, 16]
        ]
    
    d['part2'] = np.select(condlist=condlist, choicelist=labels, 
                           default='ON')
    
    d['part2'] = pd.Categorical(d['part2'], categories=labels, ordered=True)
    

    Equality check:

    d['part'].equals(d['part2'])
    # True
    

    Explanation

    • Get timedelta in seconds with Series.dt.normalize + df.total_seconds.
    • Set up condlist ("bins") with delta smaller than 09:30 (9.5), 10, etc.
    • Pass condlist to np.select and use default='ON' to get >= 16:00 as 'ON' as well.
    • Finally, apply pd.Categorical with categories=labels and ordered=True to get: ['ON' < 'FH' < 'M' < 'SLH' < 'LH'].