Search code examples
pythonpandasdata-analysis

Create a feature table in Python from a df


I have the following df:

 id    step1 step2 step3 step4 .... stepn-1, stepn, event 
  1       a     b    c     null         null  null    1
  2       b     d    f     null         null  null    0
  3       a     d    g      h             l    m      1      

Where the id is a session, the steps represent a certain path, and event is whether something specific happened

I want to create a feature store where we take all the possible steps (a, b, c, ... all the way to some arbitrary number) and make them the columns. Then I want the x-column to remain the id and it just fill a 1 or zero if that session hit that step in the column. The result is below:

id  a  b  c  d  e  f  g ... n event
 1  1  1  1  0  0  0  0     0   1
 2  0  1  0  0  0  1  0     0   0
 3  1  0  0  1  0  0  1     1   1

I have a unique list of all the possible steps which I assume will be used to construct the new table. But after that I am struggling thinking how to create this.


Solution

  • What you are looking for is often used in machine learning, and is called one-hot encoding.

    There is a pandas function specifically designed for this purpose, called pd.get_dummies().

    step_cols = [c for c in df.columns if c.startswith('step')]
    other_cols = [c for c in df.columns if not c.startswith('step')]
    
    new_df = pd.get_dummies(df[step_cols].stack()).groupby(level=0).max()
    new_df[other_cols] = df[other_cols]
    

    Output:

    >>> new_df
       a  b  c  d  f  g  h  l  m  id  event
    0  1  1  1  0  0  0  0  0  0   1      1
    1  0  1  0  1  1  0  0  0  0   2      0
    2  1  0  0  1  0  1  1  1  1   3      1