Search code examples
pythonpandasfeature-engineering

Using a string function arg to name new feature in pandas DF


I'm trying to write a python function that will allow me to add features to a pandas df for machine learning. I think I'm misunderstanding how strings can be in used in python functions.

The function looks at a row of a df, checks to see if the row identifier however many months in the future (number of rows below) has the same identifier. If it does it adds the value of the future row's 'start' feature to the new feature column, else 'end' of the initial row. It's a customized shift function.

Once I have this feature added I'd like to add a further column of 1s or 0s as a new feature to the df with the approriate column label. This will be labeled something like 'feat_so_many_months_in_future_is_higher_or_lower'.

The problem is that I can't even get to the second binary around a threshold part. I'm having an issue adding the first new feature with the appropriate name.

def binary_up_down(name_of_new_feature, months_in_future, percent_threshold):
    name_of_new_feature = [] 
    for i in range(0, df.shape[0], 1): 
        try:
            if df['identifier'][i]==df['identifier'][i + months_in_future]:
                name_of_new_feature.append(df['start'][i + months_in_future])
            else:
                name_of_new_feature.append(df['end'][i])
        except KeyError:
                name_of_new_feature.append(df['end'][i])

    df[str(name_of_new_feature)]=name_of_new_feature

    ### Add test to check if shifted value is above or below threshold and name new feature  
        appropriately ###

    return df

My thought is to call the function as follows:

binary_up_down('feat_value_in_1m', 1, 5)
#Then
binary_up_down('feat_value_in_3m', 3, 5) # and on an on...

When I run the code this line seems to be the problem:

df[str(name_of_new_feature)] = name_of_new_feature

...because it adds all the new feature column values as the column name!

Any pointers much appreciated!


Solution

  • You're replacing name_of_new_feature with a list in the first line of your function. I would recommend renaming it to something like value_of_new_feature

    def binary_up_down(name_of_new_feature, months_in_future, percent_threshold):
        value_of_new_feature = [] 
        for i in range(0, df.shape[0], 1): 
            try:
                if df['identifier'][i]==df['identifier'][i + months_in_future]:
                    value_of_new_feature .append(df['start'][i + months_in_future])
                else:
                    value_of_new_feature .append(df['end'][i])
            except KeyError:
                    value_of_new_feature .append(df['end'][i])
    
        df[name_of_new_feature]=value_of_new_feature 
    
        ### Add test to check if shifted value is above or below threshold and name new feature  
            appropriately ###
    
        return df