Search code examples
pythonpandasdataframe

Update column depending if other column value is in list


Im working with a data set that needs some manual cleanup. One thing that i need to do is assign a certain value in one column to some of my rows, if in another column, that row has a value that is present in a list ive defined.

So here a reduced example of what i want to do:

to_be_changed = ['b','e','a']

df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})

# change col1 in all rows which label shows up in to_be_changed to 3

So the desidered modified Dataframe would look like:

  col1 col2
0    3    a
1    3    b
2    2    c
3    1    d
4    3    e

My closest attempt to solving this is:

df = pd.DataFrame(np.where(df=='b' ,3,df)
  ,index=df.index,columns=df.columns)

Which produces:

 col1 col2
0    1    a
1    2    3
2    2    c
3    1    d
4    2    e

This only changes col2 and obviously only the rows with the hardcoded-label 'b'.

I also tried:

df = pd.DataFrame(np.where(df in to_be_changed ,3,df)
  ,index=df.index,columns=df.columns)

But that produces an error:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_11084/574679588.py in <cell line: 4>()
      3 df = pd.DataFrame({'col1':[1,2,2,1,2],'col2':['a','b','c','d','e' ]})
      4 df = pd.DataFrame(
----> 5   np.where(df in to_be_changed ,3,df)
      6   ,index=df.index,columns=df.columns)
      7 df

~/.local/lib/python3.9/site-packages/pandas/core/generic.py in __nonzero__(self)
   1525     @final
   1526     def __nonzero__(self):
-> 1527         raise ValueError(
   1528             f"The truth value of a {type(self).__name__} is ambiguous. "
   1529             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Thanks for any help !


Solution

  • You could also use pandas loc (documentation), using the same isin() function:

    import pandas as pd
    
    to_be_changed = ['b', 'e', 'a']
    df = pd.DataFrame(
        {
            'col1':[1, 2, 2, 1, 2],
            'col2':['a', 'b', 'c', 'd', 'e' ]
        }
    )
    
    df.loc[df['col2'].isin(to_be_changed), 'col1'] = 3
    

    produces the expected output:

       col1 col2
    0     3    a
    1     3    b
    2     2    c
    3     1    d
    4     3    e
    

    I find it usefull because you can change several columns at once given the same condition:

    import pandas as pd
    
    to_be_changed = ['b', 'e', 'a']
    df = pd.DataFrame(
       {
           'col1':[1, 2, 2, 1, 2],
           'col2':['a', 'b', 'c', 'd', 'e'],
           'col3':[5, 6, 7, 8, 9]
       }
    )
    
    df.loc[df['col2'].isin(to_be_changed), ['col1', 'col3']] = [3, 0]
    

    which gives you:

       col1 col2  col3
    0     3    a     0
    1     3    b     0
    2     2    c     7
    3     1    d     8
    4     3    e     0
    

    However for large dataframes, np.where is probably faster... but I didn't check.