Search code examples
pythondata-sciencefeature-extractionfeature-engineeringfeaturetools

How do I create features in featuretools for rows with the same id and a time index?


I have a Dataframe like this

data = {'Customer':['C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C3', 'C3', 'C3'],
        'NumOfItems':[3, 2, 4, 5, 5, 6, 10, 6, 14],
        'PurchaseTime':["2014-01-01", "2014-01-02", "2014-01-03","2014-01-01", "2014-01-02", "2014-01-03","2014-01-01", "2014-01-02", "2014-01-03"]
       }
df = pd.DataFrame(data)
df

I want to create a Feature which is for example the max value for each customer up to this point:

'MaxPerID(NumOfItems)':[3, 3, 4, 5, 5, 6, 10, 10, 14] #the output i want

So I set up the EntitySet and normalize it …

es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="customer",
                              dataframe=df,
                              index='index',
                              time_index="PurchaseTime",
                             make_index=True)

es = es.normalize_entity(base_entity_id="customer",
                         new_entity_id="sessions",
                         index="Customer")

But creating the feature matrix doesn't produce the results i want.

feature_matrix, features = ft.dfs(entityset=es,
                                 target_entity="customer",
                                 agg_primitives = ["max"],
                                 max_depth = 3                                      
                                 )
feature_matrix.head

sessions.MAX(customer.NumOfItems)  
index                                                                         
0                                      4                                    
3                                      6                                    
6                                     14                                    
1                                      4                                    
4                                      6                                    
7                                     14                                    
2                                      4                                    
5                                      6                                    
8                                     14                                    

The returned feature is the max value per day from all customers (sorted by time), however if I run the same code without the time_index = "PurchaseTime" the result is the max value just for the specific customer

    sessions.MAX(customer.NumOfItems)  \
index                                                                       
0                    4   
1                    4   
2                    4   
3                    6   
4                    6   
5                    6   
6                   14   
7                   14   
8                   14   
                             

I want a combination of these two: the max value for the specific customer up to this point. Is this possible? I tried to work with es['customer']['Customer'].interesting_values =['C1', 'C2', 'C3'] but it didn't get me anywhere. I also tried modifying the new normalized entity and writing my own primitive for this.

I'm new to featuretools so any help would be greatly appreciated.

This Question is similar to mine but the solution has no time_index and is creating the new features on the normalized entity


Solution

  • Thanks for the question. You can get the expected output by using a group by transform primitive.

    fm, fd = ft.dfs(
        entityset=es,
        target_entity="customer",
        groupby_trans_primitives=['cum_max'],
    )
    

    You should get the cumulative max of the number of items per customer.

    column = 'CUM_MAX(NumOfItems) by Customer'
    actual = fm[[column]].sort_values(column)
    expected = {'MaxPerID(NumOfItems)': [3, 3, 4, 5, 5, 6, 10, 10, 14]}
    actual.assign(**expected)
    
           CUM_MAX(NumOfItems) by Customer  MaxPerID(NumOfItems)
    index
    0                                  3.0                     3
    1                                  3.0                     3
    2                                  4.0                     4
    3                                  5.0                     5
    4                                  5.0                     5
    5                                  6.0                     6
    6                                 10.0                    10
    7                                 10.0                    10
    8                                 14.0                    14