Search code examples
pythonpandasfeature-engineering

Create A count Features right


I have a table like that:

PC USER Period
A U1 2020/01
A U2 2020/02
A U3 2020/03
A U4 2020/04
B U1 2020/01
B U2 2020/02
B U2 2020/03

I need to create a "feature count" to check how many time "PC" found and "user" connected

now for a machine learning algorithm what is the best way to do that? or even a combination of them

  1. for every row show the max connection:
PC USER Period Max Con Per PC Max Con Per USER
A U1 2020/01 4 2
A U2 2020/02 4 3
A U3 2020/03 4 1
A U4 2020/04 4 1
B U1 2020/01 3 2
B U2 2020/02 3 3
B U2 2020/03 3 3
  1. for every row show the COUNT OF the connection:
PC USER Period Max Con Per PC Max Con Per USER
A U1 2020/01 1 1
A U2 2020/02 2 1
A U3 2020/03 3 1
A U4 2020/04 4 1
B U1 2020/01 1 2
B U2 2020/02 2 2
B U2 2020/03 3 3

Solution

  • You can use groupby then transform and cumcount:

    # Use lazy groups
    gpc = df.groupby('PC')['PC']
    gus = df.groupby('USER')['Period']
    
    df['Max Con Per PC'] = gpc.transform('size')
    df['Count Con Per PC'] = gpc.cumcount() + 1
    df['Max Con Per USER'] = gus.transform('size')
    df['Count Con Per USER'] = gus.cumcount() + 1
    

    Output:

    >>> df
      PC USER   Period  Max Con Per PC  Count Con Per PC  Max Con Per USER  Count Con Per USER
    0  A   U1  2020/01               4                 1                 2                   1
    1  A   U2  2020/02               4                 2                 3                   1
    2  A   U3  2020/03               4                 3                 1                   1
    3  A   U4  2020/04               4                 4                 1                   1
    4  B   U1  2020/01               3                 1                 2                   2
    5  B   U2  2020/02               3                 2                 3                   2
    6  B   U2  2020/03               3                 3                 3                   3