I often use pandas.DataFrame.assign() in order to method chain in Python.
When calculating values using existing columns, I never have to use lambda. But if I want to create a calculated column using a column I created within the same assign statement, I have to use lambda x. So the code below works, but I simply do not understand why lambda works in the code below.
Let's say I have an existing Dataframe with columns A, B, C. Using an assign statement, I want to change A by multiplying A and B. I also create a new column D, by multiplying B and C. Then I want to multipy C and D (this only works using lambda, why does lambda remember that I created column D but the normal df['D'] * df['C'] does not?
A | B | C |
---|---|---|
One | Two | Three |
df = (df
.assign(A = df['A'] * df['B'],
D = df['B'] * df['C'],
D = lambda x: x['D'] * x['C']))
Assigning multiple columns within the same assign is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.
Firstly it has to do with the order of execution.
With .assign(A = df['A'] * df['B']
, the df['A']
is evaluated before df.assign
executes.
df = pd.DataFrame({"A": [1], "B": [2], "C": [3]})
assign = df.assign
def debug_assign(**kwargs):
print("Hello from: assign()")
print(datetime.now())
assign(**kwargs)
df.assign = debug_assign
>>> df.assign(D = new_value())
Hello from: new_value()
2023-02-14 16:08:38.424683
Hello from: assign()
2023-02-14 16:08:38.424722
As for a lambda - it is like a "mini-function", when you declare a lambda, it's like defining a function, nothing is actually executed.
>>> lambda x: x['D'] * x['C']
<function __main__.<lambda>(x)>
Meaning:
>>> df.assign(D = lambda x: x['D'] * x['C'])
Is similar to doing:
>>> def callback(): return x['D'] * x['C']
>>> df.assign(D = callback)
Functions can be assigned to variables and passed as arguments.
>>> my_other_print = print
>>> my_other_print
<function print>
They're not executed/called until ()
is used - (notice there is no ()
in D = callback
)
>>> my_other_print("hello")
hello
pandas checks if something is a "callable" - if it is, it is run against the current "state", that is, any previous assign arguments that have been computed are included.