Search code examples
pythonpandasmachine-learningone-hot-encoding

event start-end into hot encoding in python


I have a pandas dataframe with 2 columns "type" and "sign" as follows

    type    sign
0   open    A
1   open    B
2   open    D
3   close   B
4   close   D
5   open    B
6   close   B
7   close   A

"A" + "open" means that event A has started happening. "A" + "close" means that event A has stopped. I need to vectorize it, when time is a major issue (the list is actually about 40 million lines long). Kind of like one-hot-encoding, but i need "1" to exist if and only if an event is "active". for example, for this case the results should be:

    A   B   C   D   type    sign
0   1   0   0   0   open    A
1   1   1   0   0   open    B
2   1   1   0   1   open    D
3   1   0   0   1   close   B
4   1   0   0   0   close   D
5   1   1   0   0   open    B
6   1   0   0   0   close   B
7   0   0   0   0   close   A

Ideas? thanks


Solution

  • For your data, you can just do a for loop:

    for sign in 'ABCD':
        df[sign] = (df['type'].where(df['sign'].eq(sign))
                     .ffill()
                     .eq('open')
                     .astype(int)
                   )   
    

    Output:

        type sign  A  B  C  D
    0   open    A  1  0  0  0
    1   open    B  1  1  0  0
    2   open    D  1  1  0  1
    3  close    B  1  0  0  1
    4  close    D  1  0  0  0
    5   open    B  1  1  0  0
    6  close    B  1  0  0  0
    7  close    A  0  0  0  0