Search code examples
pythonpandasgroup-byfeaturetoolsfeature-engineering

How does df.groupby('A').agg('min') translate to featuretools?


Say I have this simple snippet of code. I will group, aggregate, and merge the dataframe:


Using Pandas:


Data

df = pd.DataFrame({'A': [1, 1, 2, 2],
               'B': [1, 2, 3, 4],
               'C': [0.3, 0.2, 1.2, -0.5]})
df:
    A   B   C
0   1   1   0.3
1   1   2   0.2
2   2   3   1.2
3   2   4   -0.5

Group and Aggregate

df_result = df.groupby('A').agg('min')
df_result.columns =  ['groupby_A(min_'+x+')' for x in df_result.columns]
df_result:
    groupby_A(min_B)    groupby_A(min_C)
A       
1   1                   0.2
2   3                   -0.5

Merge

df_new = pd.merge(df,df_result,on='A')
df_new
df_new:
    A   B   C       groupby_A(min_B)    groupby_A(min_C)
0   1   1   0.3     1                   0.2
1   1   2   0.2     1                   0.2
2   2   3   1.2     3                  -0.5
3   2   4   -0.5    3                  -0.5

An Attempt using featuretools:


# ---- Import the Module ----
import featuretools as ft

# ---- Make the Entity Set (the set of all tables) ----
es = ft.EntitySet()

# ---- Make the Entity (the table) ----
es.entity_from_dataframe(entity_id = 'df', 
                         dataframe = df)


# ---- Do the Deep Feature Synthesis (group, aggregate, and merge the features) ----
feature_matrix, feature_names = ft.dfs(entityset = es, 
                                       target_entity = 'df',
                                       trans_primitives = ['cum_min'])

feature_matrix
feature_matrix:
        A       B       C       CUM_MIN(A)  CUM_MIN(B)  CUM_MIN(C)
index                       
0       1       1       0.3     1           1           0.3
1       1       2       0.2     1           1           0.2
2       2       3       1.2     1           1           0.2
3       2       4       -0.5    1           1           -0.5

How does the operation with Pandas translate into featuretools (preferably without adding another table)?

My attempt with featuretools does not give the right output, but I believe the process that I used is somewhat correct.


Solution

  • Here is the recommended way to do it in Featuretools. You do need to create another table to make it work exactly as you want.

    import featuretools as ft
    import pandas as pd
    
    df = pd.DataFrame({'A': [1, 1, 2, 2],
                       'B': [1, 2, 3, 4],
                       'C': [0.3, 0.2, 1.2, -0.5]})
    
    es = ft.EntitySet()
    
    es.entity_from_dataframe(entity_id="example",
                              index="id",
                              make_index=True,
                              dataframe=df)
    
    es.normalize_entity(new_entity_id="a_entity",
                        base_entity_id="example",
                        index="A")
    
    fm, fl = ft.dfs(target_entity="example",
                    entityset=es,
                    agg_primitives=["min"])
    
    fm
    

    this returns

        A  B    C  a_entity.MIN(example.B)  a_entity.MIN(example.C)
    id                                                             
    0   1  1  0.3                        1                      0.2
    1   1  2  0.2                        1                      0.2
    2   2  3  1.2                        3                     -0.5
    3   2  4 -0.5                        3                     -0.5
    

    If you don't want to create an extra table you could try using the cum_min primitive which calculate the cumulative after grouping by A

    df = pd.DataFrame({'A': [1, 1, 2, 2],
                       'B': [1, 2, 3, 4],
                       'C': [0.3, 0.2, 1.2, -0.5]})
    
    es = ft.EntitySet()
    
    es.entity_from_dataframe(entity_id="example",
                              index="id",
                              make_index=True,
                              variable_types={
                                  "A": ft.variable_types.Id
                              },
                              dataframe=df,)
    
    fm, fl = ft.dfs(target_entity="example",
                    entityset=es,
                    groupby_trans_primitives=["cum_min"])
    
    fm
    

    this returns

        B    C  A  CUM_MIN(C) by A  CUM_MIN(B) by A
    id                                             
    0   1  0.3  1              0.3              1.0
    1   2  0.2  1              0.2              1.0
    2   3  1.2  2              1.2              3.0
    3   4 -0.5  2             -0.5              3.0