Search code examples
pythonpython-3.xfeature-engineeringfeaturetools

FeatureTools: Dealing with many-to-many relationships


I have a dataframe of purchases with multiple columns, including the three below:

 PURCHASE_ID (index of purchase)
 WORKER_ID (index of worker)
 ACCOUNT_ID (index of account)

A worker can have multiple accounts associated to them, and an account can have multiple workers.

If I create WORKER and ACCOUNT entities and add the relationships then I get an error:

KeyError: 'Variable: ACCOUNT_ID not found in entity'

Here is my code so far:

import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes

d = {'PURCHASE_ID': [1, 2], 
     'WORKER_ID': [0, 0], 
     'ACCOUNT_ID': [1, 2], 
     'COST': [5, 10], 
     'PURCHASE_TIME': ['2018-01-01 01:00:00', '2016-01-01 02:00:00']}
df = pd.DataFrame(data=d)

data_variable_types = {'PURCHASE_ID': vtypes.Id,
                       'WORKER_ID': vtypes.Id,
                       'ACCOUNT_ID': vtypes.Id,
                       'COST': vtypes.Numeric,
                       'PURCHASE_TIME': vtypes.Datetime}

es = ft.EntitySet('Purchase')
es = es.entity_from_dataframe(entity_id='purchases',
                               dataframe=df,
                               index='PURCHASE_ID',
                               time_index='PURCHASE_TIME',
                               variable_types=data_variable_types)

es.normalize_entity(base_entity_id='purchases',
                   new_entity_id='workers',
                   index='WORKER_ID',
                   additional_variables=['ACCOUNT_ID'],
                   make_time_index=False)

es.normalize_entity(base_entity_id='purchases',
                   new_entity_id='accounts',
                   index='ACCOUNT_ID',
                   additional_variables=['WORKER_ID'],
                   make_time_index=False)

fm, features = ft.dfs(entityset=es,
                     target_entity='purchases',
                     agg_primitives=['mean'],
                     trans_primitives=[],
                     verbose=True)
features

How do I separate the entities to include many-to-many relationships?


Solution

  • Your approach is correct, however you don't need to use the additional_variables variables argument. If you omit it, your code will run without issues.

    The purpose of additional_variables to EntitySet.normalize_entity is to include other variables you want in new parent entity you are creating. For example, say you had variables about a hire date, salary, location, etc. You would put those as additional variables because they are static with respect to a worker. In this, case I don't think you have any variables like that.

    Here is the code and output I see

    import pandas as pd
    import featuretools as ft
    import featuretools.variable_types as vtypes
    
    d = {'PURCHASE_ID': [1, 2], 
         'WORKER_ID': [0, 0], 
         'ACCOUNT_ID': [1, 2], 
         'COST': [5, 10], 
         'PURCHASE_TIME': ['2018-01-01 01:00:00', '2016-01-01 02:00:00']}
    df = pd.DataFrame(data=d)
    
    data_variable_types = {'PURCHASE_ID': vtypes.Id,
                           'WORKER_ID': vtypes.Id,
                           'ACCOUNT_ID': vtypes.Id,
                           'COST': vtypes.Numeric,
                           'PURCHASE_TIME': vtypes.Datetime}
    
    es = ft.EntitySet('Purchase')
    es = es.entity_from_dataframe(entity_id='purchases',
                                   dataframe=df,
                                   index='PURCHASE_ID',
                                   time_index='PURCHASE_TIME',
                                   variable_types=data_variable_types)
    
    es.normalize_entity(base_entity_id='purchases',
                       new_entity_id='workers',
                       index='WORKER_ID',
                       make_time_index=False)
    
    es.normalize_entity(base_entity_id='purchases',
                       new_entity_id='accounts',
                       index='ACCOUNT_ID',
                       make_time_index=False)
    
    fm, features = ft.dfs(entityset=es,
                         target_entity='purchases',
                         agg_primitives=['mean'],
                         trans_primitives=[],
                         verbose=True)
    features
    

    this outputs

    [<Feature: WORKER_ID>,
     <Feature: ACCOUNT_ID>,
     <Feature: COST>,
     <Feature: workers.MEAN(purchases.COST)>,
     <Feature: accounts.MEAN(purchases.COST)>]
    

    If we change the target entity and increase the depth

    fm, features = ft.dfs(entityset=es,
                         target_entity='workers',
                         agg_primitives=['mean', 'count'],
                         max_depth=3,
                         trans_primitives=[],
                         verbose=True)
    features
    

    the output is now features for the workers entity

    [<Feature: COUNT(purchases)>,
     <Feature: MEAN(purchases.COST)>,
     <Feature: MEAN(purchases.accounts.MEAN(purchases.COST))>,
     <Feature: MEAN(purchases.accounts.COUNT(purchases))>]
    

    Let's explain the feature named MEAN(purchases.accounts.COUNT(purchases))>

    1. For a given worker, find each of the purchases related to that worker.
    2. For each of those purchases, calculate the total number of purchases made by the account who involved in that particular purchase.
    3. Average this count across all of the given worker's purchases.

    In other words, "what is the average number of purchases made by accounts related to purchases made by this worker".