I am building a model to predict if a user will purchase a subscription based on his/her read history, etc. (activity). I am using featuretools
(https://www.featuretools.com/) to automate feature engineering and this is where it gets tricky:
How should I decide the cutoff time / window for my training data given that:
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="users",
max_depth=2,
agg_primitives=["sum", "std", "max", "min", "mean", "median", "count", "percent_true", "num_unique", "mode",
"avg_time_between"],
trans_primitives=["day", "year", "month", "weekday", "time_since_previous", "time_since", "is_weekend"],
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window=ft.Timedelta(180,"d"),
n_jobs=8, verbose=True)
How you decide the cutoff times for your training data will depend on the following:
How long should the training window be 1 month, 6 months, etc?
I think you can try different training window sizes to see which gives better results with the model.
Given that user activity may be different pre and post subscription, I should cutoff data for current subscribers based on when they subscribed (prevent leakage). But when I should I cutoff for non-subscribers?
I think you can pick them randomly or at times that are representative of when you’re going to use the model on those subscribers in the future.
Our open source library Compose is ideal for structuring this labeling process. If you define your prediction problem in Compose, it will automatically select the negative examples based on how you define the prediction problem. It also has a parameterized prediction window to let you generate labels at specific times. Let me know if this helps.