Search code examples
pythonpandasfeature-extractionfeaturetools

Using multiple Ids in featuretools


I have a dataset which I would like to conduct automatic feature engineering on. However it is time series based, so in order to make it work I have to use 2 things as ids, the object id and the date.

x = pd.DataFrame({'id': [1,2,1], 'date': [2012021,2032021,4052021], 'x1': [1,2,3]})
y = pd.DataFrame({'id': [1,2,1], 'date': [2012021,2032021,4052021], 'label': [3,2,1]})
entities = {"features": (x, ['id','date']), "labels": (y, ['id','date'])}
feature_matrix, features_defs = ft.dfs(entities=entities,target_entity="y")

When I run this I get this error:

TypeError: unhashable type: 'list'

How do I fix this?


Solution

  • You are right, but here, you should create unique index for entity set and then use the right one (id) in dfs. I would recommend this way:

    1. Create single dataframe instead of two
    data = pd.DataFrame({'id': [1,2,1], 'date': [2012021,2032021,4052021], 'x1': [1,2,3], 'label': [3,2,1]})
    
    1. Add unique index to column
    data['index'] = data.index
    
    1. Create entity set
    es = ft.EntitySet('My EntitySet')
    
    1. Create entity from dataframe (not using two kinds of indexes)
    es.entity_from_dataframe(
        entity_id='main_data',
        dataframe=data,
        index='index',
        time_index='date'
    )
    
    1. Normalize it
    es.normalize_entity(
        base_entity_id='main_data',
        new_entity_id='observations',
        index='id',
        make_time_index=True
    )
    
    1. Create features (don't forget to set e.g. aggregation if you do not want to use the default setting)
    feature_matrix, features_defs = ft.dfs(entityset=es, target_entity="main_data")
    

    There might be another or even better way how to deal with this, check this github question or this SO answer.