I have a Dataframe like this
data = {'Customer':['C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C3', 'C3', 'C3'],
'NumOfItems':[3, 2, 4, 5, 5, 6, 10, 6, 14],
'PurchaseTime':["2014-01-01", "2014-01-02", "2014-01-03","2014-01-01", "2014-01-02", "2014-01-03","2014-01-01", "2014-01-02", "2014-01-03"]
}
df = pd.DataFrame(data)
df
I want to create a Feature which is for example the max value for each customer up to this point:
'MaxPerID(NumOfItems)':[3, 3, 4, 5, 5, 6, 10, 10, 14] #the output i want
So I set up the EntitySet and normalize it …
es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="customer",
dataframe=df,
index='index',
time_index="PurchaseTime",
make_index=True)
es = es.normalize_entity(base_entity_id="customer",
new_entity_id="sessions",
index="Customer")
But creating the feature matrix doesn't produce the results i want.
feature_matrix, features = ft.dfs(entityset=es,
target_entity="customer",
agg_primitives = ["max"],
max_depth = 3
)
feature_matrix.head
sessions.MAX(customer.NumOfItems)
index
0 4
3 6
6 14
1 4
4 6
7 14
2 4
5 6
8 14
The returned feature is the max value per day from all customers (sorted by time), however if I run the same code without the time_index = "PurchaseTime"
the result is the max value just for the specific customer
sessions.MAX(customer.NumOfItems) \
index
0 4
1 4
2 4
3 6
4 6
5 6
6 14
7 14
8 14
I want a combination of these two: the max value for the specific customer up to this point.
Is this possible? I tried to work with es['customer']['Customer'].interesting_values =['C1', 'C2', 'C3']
but it didn't get me anywhere. I also tried modifying the new normalized entity and writing my own primitive for this.
I'm new to featuretools so any help would be greatly appreciated.
Thanks for the question. You can get the expected output by using a group by transform primitive.
fm, fd = ft.dfs(
entityset=es,
target_entity="customer",
groupby_trans_primitives=['cum_max'],
)
You should get the cumulative max of the number of items per customer.
column = 'CUM_MAX(NumOfItems) by Customer'
actual = fm[[column]].sort_values(column)
expected = {'MaxPerID(NumOfItems)': [3, 3, 4, 5, 5, 6, 10, 10, 14]}
actual.assign(**expected)
CUM_MAX(NumOfItems) by Customer MaxPerID(NumOfItems)
index
0 3.0 3
1 3.0 3
2 4.0 4
3 5.0 5
4 5.0 5
5 6.0 6
6 10.0 10
7 10.0 10
8 14.0 14