I am using featuretools (1.1x version), I read the docs,and also searched here
but still struggle to find how to do simple things like SELECT MIN(datetime_field_1)..
I also checked list_primitives() those related to time seem not what I need,
I can do this for numeric fields, but seems can't do it on Datetime fields..
https://featuretools.alteryx.com/en/stable/
I simply want to get min(timestamp), max(timestamp) group by customer_id, but max/min primitive only works on numeric
import featuretools as ft
import pandas as pd
import numpy as np
# make some random data
n = 100
events_df = pd.DataFrame({
"id" : range(n),
"customer_id": np.random.choice(["a", "b", "c"], n),
"timestamp": pd.date_range("Jan 1, 2019", freq="1h", periods=n),
"amount": np.random.rand(n) * 100
})
def to_part_of_day(x):
if x < 12:
return "morning"
elif x < 18:
return "afternoon"
else:
return "evening"
es = ft.EntitySet(id='my_set')
es = es.add_dataframe(dataframe = events_df, dataframe_name = 'events', time_index='timestamp', index='index')
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name='events',
agg_primitives=['min','max'],
trans_primitive=[],
primitive_options={
'max':{
"include_groupby_columns":{"events":["customer_id"]}
}
}
)
How should I get max(amount), max(timestamp) by each customer_id? Thanks! Feels silly to ask such basic thing after reading featuretools.alteryx.com and their github examples..
I think you have a few issues here. First of all, the Max
and Min
primitives only operate on numeric columns, as you mention. If your data is sorted based on the datetime value, you could use the First
and Last
aggregation primitives to get the first and last values, respectively, which will correspond to Min
and Max
if sorted in ascending order. If these primitives aren't sufficient, you will need to define your own custom primitive.
Second, by default, Featuretools will not return datetime values as features. In order to get those values returned, you will need to change the return_types
in your call to DFS to include datetime values.
Finally, your example above only uses a single dataframe in the EntitySet
. Featuretools aggregations are only applied when multiple dataframes are present in an EntitySet. Aggregations are performed across the defined relationship. To define features for a customer you would target the customer table and then aggregate values from other tables (like purchases for example) to get features like MAX(purchases.amount)
or FIRST(purchases.date)
.
Here is a complete example building off your starting data:
import featuretools as ft
import pandas as pd
import numpy as np
# make some random data
n = 100
purchases_df = pd.DataFrame({
"id" : range(n),
"customer_id": np.random.choice(["a", "b", "c"], n),
"date": pd.date_range("Jan 1, 2019", freq="1h", periods=n),
"amount": np.random.rand(n) * 100
})
customers_df = pd.DataFrame({
"id": ["a", "b", "c"],
})
es = ft.EntitySet()
es = es.add_dataframe(dataframe=purchases_df, dataframe_name="purchases", time_index="date", index="id")
es = es.add_dataframe(dataframe=customers_df, dataframe_name="customers", index="id")
es.add_relationship("customers", "id", "purchases", "customer_id")
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_dataframe_name='customers',
agg_primitives=["min", "max", "first", "last"],
trans_primitives=["month"],
return_types="all")
FIRST(purchases.amount) FIRST(purchases.date) FIRST(purchases.id) LAST(purchases.amount) LAST(purchases.date) ... MIN(purchases.amount) FIRST(purchases.MONTH(date)) LAST(purchases.MONTH(date)) MONTH(FIRST(purchases.date)) MONTH(LAST(purchases.date))
id ...
a 10.858168 2019-01-01 01:00:00 1 77.302337 2019-01-05 03:00:00 ... 2.639971 1 1 1 1
b 43.035418 2019-01-01 03:00:00 3 63.446195 2019-01-04 23:00:00 ... 3.790079 1 1 1 1
c 29.234700 2019-01-01 00:00:00 0 78.434346 2019-01-05 01:00:00 ... 1.088463 1 1 1 1
If you want to return only certain column types you can pass a list of types to return_types
instead of "all"
. Also, if you only want First
and Last
to apply to the datetime column you can do that by passing appropriate values to primitive_options
in the call to DFS. The documentation contains information on how to do that.