python pandas lambda hash calculated-columns

Adding calculated column to dataframe causes error using lambda function

I am trying to add a new calculated column to a dataframe based on a function that does some math. The function uses values from c1 and c2 of my dataframe as inputs as well as some predefined constant variables.

As part of the function, the values of c2 are used to lookup a value in a dictionary by useing lambda. This process throws a "TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed" at me.

There are no null or strange values my dataframe.

The function call looks something like this:

df['new column'] = some_function(df['c1'], var1, var2,... df['c2'])

The part of "some_function" that fails looks like this :

  value = some_dict.get(df['c2']) or some_dict[min(some_dict.keys(),
        key = lambda key: abs(key-df['c2']))]

If I replace df['c2'] with a constant the code runs as excepted.

If I use df['c2'].mean() i get "TypeError: 'Series' objects are mutable, thus they cannot be hashed"

print(df.info())

Returns:

<class 'pandas.core.frame.DataFrame'>
Index: 729 entries, 2019-05-08 00:00:00.000 to 2021-05-05 00:00:00.000
Data columns (total 2 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   (c1)                                               729 non-null    float64
 1   (c2)                                               729 non-null    float64
dtypes: float64(2)
memory usage: 17.1+ KB
None

c1 and c2 dont seem to differ, i tried swapping them in the function call and also use c1 as input in both places.

type(df['c1'])
Out[178]: pandas.core.frame.DataFrame

type(df['c2'])
Out[179]: pandas.core.frame.DataFrame

Any ideas how i can fix this? Should i define a lookup function instead of using lambda?

Solution

You can try to do it this way:

df['new column'] = df.apply(lambda x: some_function(x['c1'], var1, var2,... x['c2']), axis=1)

As mentioned in the comment, you cannot pass a whole Pandas Series or DataFrame to a dictionary. You need to do it element-wise. Also, those dict functions or custom functions are not designed to process in vectorized way of operations like numpy and pandas functions do.

With the use of .apply() like the above, you are passing the values of elements of each row of the dataframe to the custom function some_function() rather than passing the whole dataframe / series to the function as parameter inputs.

In particular, as you want to pass the values of df['c2'] to some_dict.get() and Python dict data type / object is not designed to work on a whole Pandas series (i.e. a Pandas column), we can bridge up this gap by passing the series broken down into element by element using this .apply() method on axis=1.

You can define some_function() in a way just like an ordinary function accepting only scalar values (not vector objects like pandas dataframe / series). E.g.

def some_function(c1_val, var1, var2,... c2_val):
    ...
    value = some_dict.get(c2_val]) or some_dict[min(some_dict.keys(),
    key = lambda key: abs(key - c2_val))]
    ....