I have a simple entity set parent1 <- child -> parent2
and a need to use a cutoff dataframe. My target is the parent1
and it's accessible at any time of predictions. I want to specify a date
column only for the parent2
so that this time
information could be joined to the child
. It doesn't work this way and I get data leakage on the first level features from the parent1-child
entities. The only thing I can do is to duplicate the date
column to the child
too. Is it possible to normalize the child
avoiding the date
column?
Example. Imagine we have 3 entities. Box player information (parent1 with "name"), match information (parent2 with "country"), and their combination (child with "n_hits" in one specific match):
import featuretools as ft
import pandas as pd
players = pd.DataFrame({"player_id": [1, 2, 3], "player_name": ["Oleg", "Kirill", "Max"]})
player_stats = pd.DataFrame({
"match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3],
"match_id": [11, 11, 12, 12], "n_hits": [20, 30, 40, 50]})
matches = pd.DataFrame({
"match_id": [11, 12], "match_date": pd.to_datetime(['2014-1-10', '2014-1-20']),
"country": ["Russia", "Germany"]})
es = ft.EntitySet()
es.entity_from_dataframe(
entity_id="players", dataframe=players,
index="player_id",
variable_types={"player_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
entity_id="player_stats", dataframe=player_stats,
index="match_player_id",
variable_types={"match_player_id": ft.variable_types.Categorical,
"player_id": ft.variable_types.Categorical,
"match_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
entity_id="matches", dataframe=matches,
index="match_id",
time_index="match_date",
variable_types={"match_id": ft.variable_types.Categorical})
es = es.add_relationship(ft.Relationship(es["players"]["player_id"],
es["player_stats"]["player_id"]))
es = es.add_relationship(ft.Relationship(es["matches"]["match_id"],
es["player_stats"]["match_id"]))
Here I want to use all available information that I have at the 15th January. So the only legal is the information for the first match, not for the second.
cutoff_df = pd.DataFrame({
"player_id":[1, 2, 3],
"match_date": pd.to_datetime(['2014-1-15', '2014-1-15', '2014-1-15'])})
fm, features = ft.dfs(entityset=es, target_entity='players', cutoff_time=cutoff_df,
cutoff_time_in_index=True, agg_primitives = ["mean"])
fm
I got
player_name MEAN(player_stats.n_hits)
player_id time
1 2014-01-15 Oleg 30
2 2014-01-15 Kirill 30
3 2014-01-15 Max 50
The only way I know to set up a proper match_date
to player_stats
is to join this information from matches
player_stats = pd.DataFrame({
"match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3],
"match_id": [11, 11, 12, 12], "n_hits": [20, 30, 40, 50],
"match_date": pd.to_datetime(
['2014-1-10', '2014-1-10', '2014-1-20', '2014-1-20']) ## a result of join
})
...
es = es.entity_from_dataframe(
entity_id="player_stats", dataframe=player_stats,
index="match_player_id",
time_index="match_date", ## a change here too
variable_types={"match_player_id": ft.variable_types.Categorical,
"player_id": ft.variable_types.Categorical,
"match_id": ft.variable_types.Categorical})
And I get the expected result
player_name MEAN(player_stats.n_hits)
player_id time
1 2014-01-15 Oleg 20.0
2 2014-01-15 Kirill 30.0
3 2014-01-15 Max NaN
Featuretools is very conservative when it comes to the time index of an entity. We try not to infer a time index if it isn't provided. Therefore, you have to create the duplicate column as you suggest.