Search code examples
featuretools

Avoid duplication of date column for child entity


I have a simple entity set parent1 <- child -> parent2 and a need to use a cutoff dataframe. My target is the parent1 and it's accessible at any time of predictions. I want to specify a date column only for the parent2 so that this time information could be joined to the child. It doesn't work this way and I get data leakage on the first level features from the parent1-child entities. The only thing I can do is to duplicate the date column to the child too. Is it possible to normalize the child avoiding the date column?

Example. Imagine we have 3 entities. Box player information (parent1 with "name"), match information (parent2 with "country"), and their combination (child with "n_hits" in one specific match):

import featuretools as ft
import pandas as pd

players = pd.DataFrame({"player_id": [1, 2, 3], "player_name": ["Oleg", "Kirill", "Max"]})
player_stats = pd.DataFrame({
    "match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3], 
    "match_id":        [11, 11, 12, 12],     "n_hits":    [20, 30, 40, 50]})
matches = pd.DataFrame({
    "match_id": [11, 12], "match_date": pd.to_datetime(['2014-1-10', '2014-1-20']),
    "country": ["Russia", "Germany"]})

es = ft.EntitySet()
es.entity_from_dataframe(
    entity_id="players", dataframe=players,
    index="player_id",
    variable_types={"player_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
    entity_id="player_stats", dataframe=player_stats,
    index="match_player_id",
    variable_types={"match_player_id": ft.variable_types.Categorical,
                    "player_id": ft.variable_types.Categorical,
                    "match_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
    entity_id="matches", dataframe=matches,
    index="match_id",
    time_index="match_date",
    variable_types={"match_id": ft.variable_types.Categorical})

es = es.add_relationship(ft.Relationship(es["players"]["player_id"], 
                                         es["player_stats"]["player_id"]))
es = es.add_relationship(ft.Relationship(es["matches"]["match_id"], 
                                         es["player_stats"]["match_id"]))

Here I want to use all available information that I have at the 15th January. So the only legal is the information for the first match, not for the second.

cutoff_df = pd.DataFrame({
  "player_id":[1, 2, 3], 
  "match_date": pd.to_datetime(['2014-1-15', '2014-1-15', '2014-1-15'])})

fm, features = ft.dfs(entityset=es, target_entity='players', cutoff_time=cutoff_df, 
                      cutoff_time_in_index=True, agg_primitives = ["mean"])
fm

I got

                     player_name  MEAN(player_stats.n_hits)
player_id time                                             
1         2014-01-15        Oleg                         30
2         2014-01-15      Kirill                         30
3         2014-01-15         Max                         50

The only way I know to set up a proper match_date to player_stats is to join this information from matches

player_stats = pd.DataFrame({
    "match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3], 
    "match_id":        [11, 11, 12, 12],     "n_hits":    [20, 30, 40, 50],
    "match_date": pd.to_datetime(
       ['2014-1-10', '2014-1-10', '2014-1-20', '2014-1-20']) ## a result of join
})
...
es = es.entity_from_dataframe(
    entity_id="player_stats", dataframe=player_stats,
    index="match_player_id",
    time_index="match_date",  ## a change here too
    variable_types={"match_player_id": ft.variable_types.Categorical,
                    "player_id": ft.variable_types.Categorical,
                    "match_id": ft.variable_types.Categorical})

And I get the expected result

                     player_name  MEAN(player_stats.n_hits)
player_id time                                             
1         2014-01-15        Oleg                       20.0
2         2014-01-15      Kirill                       30.0
3         2014-01-15         Max                        NaN

Solution

  • Featuretools is very conservative when it comes to the time index of an entity. We try not to infer a time index if it isn't provided. Therefore, you have to create the duplicate column as you suggest.