Search code examples
feature-extractionfeature-engineeringfeaturetools

featuretools basic aggegration on time measures


I am using featuretools (1.1x version), I read the docs,and also searched here

but still struggle to find how to do simple things like SELECT MIN(datetime_field_1)..

I also checked list_primitives() those related to time seem not what I need,

I can do this for numeric fields, but seems can't do it on Datetime fields..

https://featuretools.alteryx.com/en/stable/

I simply want to get min(timestamp), max(timestamp) group by customer_id, but max/min primitive only works on numeric

import featuretools as ft
import pandas as pd
import numpy as np

# make some random data
n = 100
events_df = pd.DataFrame({
    "id" : range(n),
    "customer_id": np.random.choice(["a", "b", "c"], n),
    "timestamp": pd.date_range("Jan 1, 2019", freq="1h", periods=n),
    "amount": np.random.rand(n) * 100 
})

def to_part_of_day(x):
    if x < 12:
        return "morning"
    elif x < 18:
        return "afternoon"
    else:
        return "evening"
es = ft.EntitySet(id='my_set')
es = es.add_dataframe(dataframe = events_df, dataframe_name = 'events', time_index='timestamp', index='index')
feature_matrix, feature_defs = ft.dfs(
  entityset=es,
  target_dataframe_name='events',
  agg_primitives=['min','max'],
  trans_primitive=[],
  primitive_options={
  'max':{
        "include_groupby_columns":{"events":["customer_id"]}
        }
  }

)


How should I get max(amount), max(timestamp) by each customer_id? Thanks! Feels silly to ask such basic thing after reading featuretools.alteryx.com and their github examples..


Solution

  • I think you have a few issues here. First of all, the Max and Min primitives only operate on numeric columns, as you mention. If your data is sorted based on the datetime value, you could use the First and Last aggregation primitives to get the first and last values, respectively, which will correspond to Min and Max if sorted in ascending order. If these primitives aren't sufficient, you will need to define your own custom primitive.

    Second, by default, Featuretools will not return datetime values as features. In order to get those values returned, you will need to change the return_types in your call to DFS to include datetime values.

    Finally, your example above only uses a single dataframe in the EntitySet. Featuretools aggregations are only applied when multiple dataframes are present in an EntitySet. Aggregations are performed across the defined relationship. To define features for a customer you would target the customer table and then aggregate values from other tables (like purchases for example) to get features like MAX(purchases.amount) or FIRST(purchases.date).

    Here is a complete example building off your starting data:

    import featuretools as ft
    import pandas as pd
    import numpy as np
    
    # make some random data
    n = 100
    purchases_df = pd.DataFrame({
        "id" : range(n),
        "customer_id": np.random.choice(["a", "b", "c"], n),
        "date": pd.date_range("Jan 1, 2019", freq="1h", periods=n),
        "amount": np.random.rand(n) * 100 
    })
    
    customers_df = pd.DataFrame({
        "id": ["a", "b", "c"],
    })
    
    es = ft.EntitySet()
    es = es.add_dataframe(dataframe=purchases_df, dataframe_name="purchases", time_index="date", index="id")
    es = es.add_dataframe(dataframe=customers_df, dataframe_name="customers", index="id")
    es.add_relationship("customers", "id", "purchases", "customer_id")
    
    feature_matrix, feature_defs = ft.dfs(
      entityset=es,
      target_dataframe_name='customers',
      agg_primitives=["min", "max", "first", "last"],
      trans_primitives=["month"],
      return_types="all")
    
        FIRST(purchases.amount) FIRST(purchases.date)  FIRST(purchases.id)  LAST(purchases.amount) LAST(purchases.date)  ...  MIN(purchases.amount)  FIRST(purchases.MONTH(date))  LAST(purchases.MONTH(date)) MONTH(FIRST(purchases.date)) MONTH(LAST(purchases.date))
    id                                                                                                                   ...
    a                 10.858168   2019-01-01 01:00:00                    1               77.302337  2019-01-05 03:00:00  ...               2.639971                             1                            1                            1                           1
    b                 43.035418   2019-01-01 03:00:00                    3               63.446195  2019-01-04 23:00:00  ...               3.790079                             1                            1                            1                           1
    c                 29.234700   2019-01-01 00:00:00                    0               78.434346  2019-01-05 01:00:00  ...               1.088463                             1                            1                            1                           1
    

    If you want to return only certain column types you can pass a list of types to return_types instead of "all". Also, if you only want First and Last to apply to the datetime column you can do that by passing appropriate values to primitive_options in the call to DFS. The documentation contains information on how to do that.