Update column depending if other column value is in list

Im working with a data set that needs some manual cleanup. One thing that i need to do is assign a certain value in one column to some of my rows, if in another column, that row has a value that is present in a list ive defined.

So here a reduced example of what i want to do:

to_be_changed = ['b','e','a']

df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})

# change col1 in all rows which label shows up in to_be_changed to 3

So the desidered modified Dataframe would look like:

  col1 col2
0    3    a
1    3    b
2    2    c
3    1    d
4    3    e

My closest attempt to solving this is:

df = pd.DataFrame(np.where(df=='b' ,3,df)
  ,index=df.index,columns=df.columns)

Which produces:

 col1 col2
0    1    a
1    2    3
2    2    c
3    1    d
4    2    e

This only changes col2 and obviously only the rows with the hardcoded-label 'b'.

I also tried:

df = pd.DataFrame(np.where(df in to_be_changed ,3,df)
  ,index=df.index,columns=df.columns)

But that produces an error:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_11084/574679588.py in <cell line: 4>()
      3 df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})
      4 df = pd.DataFrame(
----> 5   np.where(df in to_be_changed ,3,df)
      6   ,index=df.index,columns=df.columns)
      7 df

~/.local/lib/python3.9/site-packages/pandas/core/generic.py in __nonzero__(self)
   1525     @final
   1526     def __nonzero__(self):
-> 1527         raise ValueError(
   1528             f"The truth value of a {type(self).__name__} is ambiguous. "
   1529             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Thanks for any help !

Solution

You could also use pandas loc (documentation), using the same isin() function:

import pandas as pd

to_be_changed = ['b', 'e', 'a']
df = pd.DataFrame(
    {
        'col1':[1, 2, 2, 1, 2],
        'col2':['a', 'b', 'c', 'd', 'e' ]
    }
)

df.loc[df['col2'].isin(to_be_changed), 'col1'] = 3

produces the expected output:

   col1 col2
0     3    a
1     3    b
2     2    c
3     1    d
4     3    e

I find it usefull because you can change several columns at once given the same condition:

import pandas as pd

to_be_changed = ['b', 'e', 'a']
df = pd.DataFrame(
   {
       'col1':[1, 2, 2, 1, 2],
       'col2':['a', 'b', 'c', 'd', 'e'],
       'col3':[5, 6, 7, 8, 9]
   }
)

df.loc[df['col2'].isin(to_be_changed), ['col1', 'col3']] = [3, 0]

which gives you:

   col1 col2  col3
0     3    a     0
1     3    b     0
2     2    c     7
3     1    d     8
4     3    e     0

However for large dataframes, np.where is probably faster... but I didn't check.