Search code examples
pythonfeaturetools

in featuretools, How to Custom Primitives of 2 columns?


I created Custom Primitives like below.

class Correlate(TransformPrimitive):
name = 'correlate'
input_types = [Numeric,Numeric]
return_type = Numeric
commutative = True
compatibility = [Library.PANDAS, Library.DASK, Library.KOALAS]

def get_function(self):
    def correlate(column1,column2):
        return np.correlate(column1,column2,"same")
    
    return correlate

Then I checked the calculation like below just in case.

np.correlate(feature_matrix["alcohol"], feature_matrix["chlorides"],mode="same")

However above function result and below function result were difference.

Do you know why those are difference?

If my code is wrong basically, please correct me.


Solution

  • Thanks for the question! You can create a custom primitive with a fixed argument to calculate that kind of correlation by using the TransformPrimitive as a base class. I will go through an example using this data.

    import pandas as pd
    
    data = [
        [0.40168819, 0.0857946],
        [0.06268886, 0.27811651],
        [0.16931269, 0.96509497],
        [0.15123022, 0.80546244],
        [0.58610794, 0.56928692],
    ]
    
    df = pd.DataFrame(data=data, columns=list('ab'))
    df.reset_index(inplace=True)
    df
    
    index         a         b
        0  0.401688  0.085795
        1  0.062689  0.278117
        2  0.169313  0.965095
        3  0.151230  0.805462
        4  0.586108  0.569287
    

    The function np.correlate is a transform when the parameter mode=same, so define a custom primitive by using the TransformPrimitive as a base class.

    from featuretools.primitives import TransformPrimitive
    from featuretools.variable_types import Numeric
    import numpy as np
    
    
    class Correlate(TransformPrimitive):
        name = 'correlate'
        input_types = [Numeric, Numeric]
        return_type = Numeric
    
        def get_function(self):
            def correlate(a, b):
                return np.correlate(a, b, mode='same')
    
            return correlate
    

    The DFS call requires the data to be structured into an EntitySet, then you can use the custom primitive.

    import featuretools as ft
    
    es = ft.EntitySet()
    
    es.entity_from_dataframe(
        entity_id='data',
        dataframe=df,
        index='index',
    )
    
    fm, fd = ft.dfs(
        entityset=es,
        target_entity='data',
        trans_primitives=[Correlate],
        max_depth=1,
    )
    
    fm[['CORRELATE(a, b)']]
    
           CORRELATE(a, b)
    index                 
    0             0.534548
    1             0.394685
    2             0.670774
    3             0.670506
    4             0.622236
    

    You should get the same values between the feature matrix and np.correlate.

    actual = fm['CORRELATE(a, b)'].values
    expected = np.correlate(df['a'], df['b'], mode='same')
    np.testing.assert_array_equal(actual, expected)
    

    You can learn more about defining simple custom primitives and advanced custom primitives in the linked pages. Let me know if you found this helpful.