Search code examples
pythonpandasdataframeassign

Why does lambda work on new columns generated using pandas.Dataframe.assign in Python?


I often use pandas.DataFrame.assign() in order to method chain in Python.

When calculating values using existing columns, I never have to use lambda. But if I want to create a calculated column using a column I created within the same assign statement, I have to use lambda x. So the code below works, but I simply do not understand why lambda works in the code below.

Let's say I have an existing Dataframe with columns A, B, C. Using an assign statement, I want to change A by multiplying A and B. I also create a new column D, by multiplying B and C. Then I want to multipy C and D (this only works using lambda, why does lambda remember that I created column D but the normal df['D'] * df['C'] does not?

A B C
One Two Three
df = (df
      .assign(A = df['A'] * df['B'],
              D = df['B'] * df['C'],
              D = lambda x: x['D'] * x['C']))

Solution

  • Assigning multiple columns within the same assign is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.

    Firstly it has to do with the order of execution.

    With .assign(A = df['A'] * df['B'], the df['A'] is evaluated before df.assign executes.

    df = pd.DataFrame({"A": [1], "B": [2], "C": [3]})
    assign = df.assign
    
    def debug_assign(**kwargs):
        print("Hello from: assign()")
        print(datetime.now())
        assign(**kwargs)
    
    df.assign = debug_assign
    
    >>> df.assign(D = new_value())
    Hello from: new_value()
    2023-02-14 16:08:38.424683
    Hello from: assign()
    2023-02-14 16:08:38.424722
    

    As for a lambda - it is like a "mini-function", when you declare a lambda, it's like defining a function, nothing is actually executed.

    >>> lambda x: x['D'] * x['C']
    <function __main__.<lambda>(x)>
    

    Meaning:

    >>> df.assign(D = lambda x: x['D'] * x['C'])
    

    Is similar to doing:

    >>> def callback(): return x['D'] * x['C']
    >>> df.assign(D = callback)
    

    Functions can be assigned to variables and passed as arguments.

    >>> my_other_print = print
    >>> my_other_print
    <function print>
    

    They're not executed/called until () is used - (notice there is no () in D = callback)

    >>> my_other_print("hello")
    hello
    

    pandas checks if something is a "callable" - if it is, it is run against the current "state", that is, any previous assign arguments that have been computed are included.