feature-extraction feature-engineering featuretools

Using Featuretools to aggregate per time time of day

I'm wondering if there's any way to calculate all the same variables I already am using deep feature synthesis (ie counts, sums, mean, etc) for different time segments within a day?

I.e. count of morning events (hours 0-12) as a separate variable from evening events (13-24).

Also, within the same vein, what would be the easiest to get counts by day of week, day of month, day of year, etc. Custom aggregate primitives?

Solution

Yes, this is possible. First, let's generate some random data and then I'll walkthrough how

import featuretools as ft
import pandas as pd
import numpy as np

# make some random data
n = 100
events_df = pd.DataFrame({
    "id" : range(n),
    "customer_id": np.random.choice(["a", "b", "c"], n),
    "timestamp": pd.date_range("Jan 1, 2019", freq="1h", periods=n),
    "amount": np.random.rand(n) * 100 
})

def to_part_of_day(x):
    if x < 12:
        return "morning"
    elif x < 18:
        return "afternoon"
    else:
        return "evening"

events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)

events_df

the first thing we want to do is add a new column for the segment we want to calculate features for

def to_part_of_day(x):
    if x < 12:
        return "morning"
    elif x < 18:
        return "afternoon"
    else:
        return "evening"

events_df["time_of_day"] = events_df["timestamp"].dt.hour.apply(to_part_of_day)

now we have a dataframe like this

   id customer_id           timestamp     amount time_of_day
0   0           a 2019-01-01 00:00:00  44.713802     morning
1   1           c 2019-01-01 01:00:00  58.776476     morning
2   2           a 2019-01-01 02:00:00  94.671566     morning
3   3           a 2019-01-01 03:00:00  39.271852     morning
4   4           a 2019-01-01 04:00:00  40.773290     morning
5   5           c 2019-01-01 05:00:00  19.815855     morning
6   6           a 2019-01-01 06:00:00  62.457129     morning
7   7           b 2019-01-01 07:00:00  95.114636     morning
8   8           b 2019-01-01 08:00:00  37.824668     morning
9   9           a 2019-01-01 09:00:00  46.502904     morning

Next, let's load it into our entityset

es = ft.EntitySet()
es.entity_from_dataframe(entity_id="events",
                         time_index="timestamp",
                         dataframe=events_df)

es.normalize_entity(new_entity_id="customers", index="customer_id", base_entity_id="events")

es.plot()

Now, we are ready to set the segments we want to create aggregations for by using interesting_values

es["events"]["time_of_day"].interesting_values = ["morning", "afternoon", "evening"]

Then we can run DFS and place the aggregation primitives we want to do on a per segment basis in the where_primitives parameter

fm, fl = ft.dfs(target_entity="customers",
                entityset=es,
                agg_primitives=["count", "mean", "sum"],
                trans_primitives=[],
                where_primitives=["count", "mean", "sum"])

fm

In the resulting feature matrix, you can now see we have aggregations per morning, afternoon, and evening

             COUNT(events)  MEAN(events.amount)  SUM(events.amount)  COUNT(events WHERE time_of_day = afternoon)  COUNT(events WHERE time_of_day = evening)  COUNT(events WHERE time_of_day = morning)  MEAN(events.amount WHERE time_of_day = afternoon)  MEAN(events.amount WHERE time_of_day = evening)  MEAN(events.amount WHERE time_of_day = morning)  SUM(events.amount WHERE time_of_day = afternoon)  SUM(events.amount WHERE time_of_day = evening)  SUM(events.amount WHERE time_of_day = morning)
customer_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
a                       37            49.753630         1840.884300                                           12                                          7                                         18                                          35.098923                                        45.861881                                        61.036892                                        421.187073                                      321.033164                                     1098.664063
b                       30            51.241484         1537.244522                                            3                                         10                                         17                                          45.140800                                        46.170996                                        55.300715                                        135.422399                                      461.709963                                      940.112160
c                       33            39.563222         1305.586314                                            9                                          7                                         17                                          50.129136                                        34.593936                                        36.015679                                        451.162220                                      242.157549                                      612.266545