I have more than 100,000 rows of training data with timestamps and would like to calculate a feature matrix for new test data, of which there are only 10 rows. Some of the features in the test data will end up aggregating some of the training data. I need the implementation to be fast since this is one step in a real-time inference pipeline.
I can think of two ways this can be implemented:
Concatenating the train and test entity sets and running DFS and then only using the last 10 rows and throwing away the rest. This is very time consuming. Is there a way to calculate a subset of an entity set while using data from the entire entity set?
Using the steps outlined in Calculating Feature Matrix for New Data section on the Featuretools Deployment page. However, as demonstrated below, this doesn't seem to work.
Create all/train/test entity sets:
import featuretools as ft
data = ft.demo.load_mock_customer(n_customers=3, n_sessions=15)
df_sessions = data['sessions']
# Create all/train/test entity sets.
all_es = ft.EntitySet(id='sessions')
train_es = ft.EntitySet(id='sessions')
test_es = ft.EntitySet(id='sessions')
all_es = all_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions, # all sessions
index='session_id',
time_index='session_start',
)
train_es = train_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions.iloc[:10], # first 10 sessions
index='session_id',
time_index='session_start',
)
test_es = test_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions.iloc[10:], # last 5 sessions
index='session_id',
time_index='session_start',
)
# Normalise customer entities so we can group by customers.
all_es = all_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
train_es = train_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
test_es = test_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
Set cutoff_time
since we are dealing with data with timestamps:
cutoff_time = (df_sessions
.filter(['session_id', 'session_start'])
.rename(columns={'session_id': 'instance_id',
'session_start': 'time'}))
Calculate feature matrix for all data:
feature_matrix, features_defs = ft.dfs(entityset=all_es,
cutoff_time=cutoff_time,
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id | customer_id | customers.COUNT(sessions) |
---|---|---|
1 | 3 | 1 |
2 | 3 | 2 |
3 | 1 | 1 |
4 | 2 | 1 |
5 | 2 | 2 |
6 | 2 | 3 |
7 | 2 | 4 |
8 | 1 | 2 |
9 | 2 | 5 |
10 | 1 | 3 |
11 | 1 | 4 |
12 | 2 | 6 |
13 | 3 | 3 |
14 | 1 | 5 |
15 | 3 | 4 |
Calculate feature matrix for train data:
feature_matrix, features_defs = ft.dfs(entityset=train_es,
cutoff_time=cutoff_time.iloc[:10],
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id | customer_id | customers.COUNT(sessions) |
---|---|---|
1 | 3 | 1 |
2 | 3 | 2 |
3 | 1 | 1 |
4 | 2 | 1 |
5 | 2 | 2 |
6 | 2 | 3 |
7 | 2 | 4 |
8 | 1 | 2 |
9 | 2 | 5 |
10 | 1 | 3 |
Calculate feature matrix for test data (using method shown in "Feature Matrix for New Data" on the Featuretools Deployment page):
feature_matrix = ft.calculate_feature_matrix(features=features_defs,
entityset=test_es,
cutoff_time=cutoff_time.iloc[10:])
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id | customer_id | customers.COUNT(sessions) |
---|---|---|
11 | 1 | 1 |
12 | 2 | 1 |
13 | 3 | 1 |
14 | 1 | 2 |
15 | 3 | 2 |
As you can see, the feature matrix generated from train_es
matches the first 10 rows of the feature matrix generated from all_es
. However, the feature matrix generated from test_es
doesn't match the corresponding rows from the feature matrix generated from all_es
.
You can control which instances you want to generate features for with the cutoff_time
dataframe (or the instance_ids
argument in DFS if the cutoff time is a single datetime). Featuretools will only generate features for instances whose IDs are in the cutoff time dataframe and will ignore all others:
feature_matrix, features_defs = ft.dfs(entityset=all_es,
cutoff_time=cutoff_time[10:],
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
customer_id | customers.COUNT(sessions) | session_id |
---|---|---|
1 | 4 | |
2 | 6 | |
3 | 3 | |
1 | 5 | |
3 | 4 |
The method in "Feature Matrix for New Data" is useful when you want to calculate the same features but on entirely new data. All the same features will be created, but data isn't shared between the entitysets. That doesn't work in this case, since the goal is to use all the data but only generate features for certain instances.