Search code examples
pythonfeature-engineeringfeaturetoolsentityset

Featuretools - unable to add relationship in Entityset


I'm writing a notebook using this data from Kaggle. Here's a screenshot of the two tables just to show we have ID columns in both. enter image description here

Here's my code when trying to set up the Entity Set and add a relationship.

import featuretools as ft 
import pandas as pd

es = ft.EntitySet()
es = es.add_dataframe(dataframe=train_sampled, index='new_index', dataframe_name='application', make_index=True)
es = es.add_dataframe(dataframe=bureau, index='new_index', dataframe_name='bureau', make_index=True)

new_relationship = ft.Relationship(entityset=es,parent_dataframe_name='application',parent_column_name='SK_ID_CURR',
                    child_dataframe_name='bureau',child_column_name='SK_ID_CURR')
es = es.add_relationship(new_relationship)

And here's the error I'm getting that doesn't make any sense.

KeyError: 'DataFrame <Relationship: bureau.SK_ID_CURR -> application.SK_ID_CURR> does not exist in entity set'

The Entityset exists but just can't add a relationship, which is the whole point of this. enter image description here

Any advice or guidance is much appreciated.

EDIT: Solution This code uses the answer below plus changes the index column in the bureau table to the correct one that is unique.

es = ft.EntitySet()
es = es.add_dataframe(dataframe=train_sampled, index='SK_ID_CURR', dataframe_name='application', make_index=False)
es = es.add_dataframe(dataframe=bureau, index='SK_ID_BUREAU', dataframe_name='bureau', make_index=False)

new_relationship = ft.Relationship(entityset=es,parent_dataframe_name='application',parent_column_name='SK_ID_CURR',
                    child_dataframe_name='bureau',child_column_name='SK_ID_CURR')
es = es.add_relationship(relationship=new_relationship)

Solution

  • If you are adding a relationship to an EntitySet by passing in a Relationship object, you need to make sure to use the relationship keyword in your call like this:

    es.add_relationship(relationship=new_relationship)
    

    Without using the relationship keyword, the method is expecting that you are passing in four values indicating parent_dataframe_name, parent_column_name, child_dataframe_name, child_column_name. Using this approach you could alternatively skip creating the Relationship object and add the relationship like this:

    es.add_relationship('application', 'SK_ID_CURR', 'bureau', 'SK_ID_CURR')
    

    Finally, you can also use the EntitySet.add_relationships method to add your relationship, which allows you to add one or more relationships to an EntitySet by passing in a list of Relationship objects:

    es.add_relationships([new_relationship])
    

    For more details on all of these methods and the expected arguments, you can always refer to the Featuretools API Reference