Search code examples
pythondata-sciencefeature-extractionfeature-engineeringfeaturetools

Python featuretools difference by data group


I'm trying to use featuretools to calculate time-series functions. Specifically, I'd like to subtract current(x) from previous(x) by a group-key (user_id), but I'm having trouble in adding this kind of relationship in the entityset.

df = pd.DataFrame({
    "user_id": [i % 2 for i in range(0, 6)],
    'x': range(0, 6),
    'time': pd.to_datetime(['2014-1-1 04:00', '2014-1-1 05:00', 
                            '2014-1-1 06:00', '2014-1-1 08:00', '2014-1-1 10:00', '2014-1-1 12:00'])
     })

print(df.to_string())
       user_id  x                time
0        0      0 2014-01-01 04:00:00
1        1      1 2014-01-01 05:00:00
2        0      2 2014-01-01 06:00:00
3        1      3 2014-01-01 08:00:00
4        0      4 2014-01-01 10:00:00
5        1      5 2014-01-01 12:00:00


es = ft.EntitySet(id='test')
es.entity_from_dataframe(entity_id='data', dataframe=df,
                         variable_types={
                             'user_id': ft.variable_types.Categorical,
                             'x': ft.variable_types.Numeric,
                             'time': ft.variable_types.Datetime
                         },
                         make_index=True, index='index',
                         time_index='time'
                         )

I then try to invoke dfs, but I can't get the relationship right...

fm, fl = ft.dfs(
    target_entity="data",
    entityset=es,
    trans_primitives=["diff"]
)
print(fm.to_string())
       user_id  x  DIFF(x)
index                     
0            0  0      NaN
1            1  1      1.0
2            0  2      1.0
3            1  3      1.0
4            0  4      1.0
5            1  5      1.0

But what I'd actually want to get is the difference by user. That is, from the last value for each user:

       user_id  x  DIFF(x)
index                     
0            0  0      NaN
1            1  1      NaN
2            0  2      2.0
3            1  3      2.0
4            0  4      2.0
5            1  5      2.0

How do I get this kind of relationship in featuretools? I've tried several tutorial but to no avail. I'm stumped.

Thanks!


Solution

  • Thanks for the question. You can get the expected output by normalizing an entity for users and applying a group by transform primitive. I'll go through a quick example using this data.

    user_id  x                time
          0  0 2014-01-01 04:00:00
          1  1 2014-01-01 05:00:00
          0  2 2014-01-01 06:00:00
          1  3 2014-01-01 08:00:00
          0  4 2014-01-01 10:00:00
          1  5 2014-01-01 12:00:00
    

    First, create the entity set and normalize an entity for the users.

    es = ft.EntitySet(id='test')
    
    es.entity_from_dataframe(
        dataframe=df,
        entity_id='data',
        make_index=True,
        index='index',
        time_index='time',
    )
    
    es.normalize_entity(
        base_entity_id='data',
        new_entity_id='users',
        index='user_id',
    )
    

    Then, apply the group by transform primitive in DFS.

    fm, fl = ft.dfs(
        target_entity="data",
        entityset=es,
        groupby_trans_primitives=["diff"],
    )
    
    fm.filter(regex="DIFF", axis=1)
    

    You should get the difference by user.

           DIFF(x) by user_id
    index
    0                     NaN
    1                     NaN
    2                     2.0
    3                     2.0
    4                     2.0
    5                     2.0