Search code examples
pythonpandasdataframelambdaassign

Python dataframe assign new column using lambda function with 2 variables and if else statement


Setup the dataframe:

import pandas as pd
import numpy as np

np.random.seed(99)

rows = 10

df = pd.DataFrame ({'A' : np.random.choice(range(0, 2), rows, replace = True),
                    'B' : np.random.choice(range(0, 2), rows, replace = True)})

df


   A  B
0  1  1
1  1  1
2  1  0
3  0  1
4  1  1
5  0  1
6  0  1
7  0  0
8  1  1
9  0  1

I would like to add a column 'C' with the value 'X' is df.A and df.B are both 0 and else value 'Y'.

I tried:

df.assign(C = lambda row: 'X' if row.A + row.B == 0 else 'Y')

but that does not work...

I found other ways to get my results but would like to use .assign with a lambda function in this situation.

Any suggestions on how to get assign with lambda working?


Solution

  • No, don't use lambda

    You can do this vectorised:

    import numpy as np
    
    df['C'] = np.where(df['A'] + df['B'] == 0, 'X', 'Y')
    

    The lambda solution has no benefit here, but if you want it...

    df = df.assign(C=np.where(df.pipe(lambda x: x['A'] + x['B'] == 0), 'X', 'Y'))
    

    The bad way to use assign + lambda:

    df = df.assign(C=df.apply(lambda x: 'X' if x.A + x.B == 0 else 'Y', axis=1))
    

    What's wrong with the bad way is you are iterating rows in a Python-level loop. It's often worse than a regular Python for loop.

    The first two solutions perform vectorised operations on contiguous memory blocks, and are processed more efficiently as a result.