Search code examples
pythonpandasnumpywhere-clausearray-broadcasting

How can I fill two columns of a dataframe with "np.where"?


I am trying to set 2 columns by a condition on a 3rd column. I can set 1 column conditions on another column, and I can set 2 columns on a single condition value, but when I try to set 2 columns by a condition on a column, it fails.

Here is the code example:

import pandas as pd
import numpy as np
AAA = {"column A": [1, 1, 1, 2, 2, 2, 3, 3, 3]}
df = pd.DataFrame(AAA)

If I call:

df["column B"], df["column C"] = np.where(True ,['4', '8'],['NaN', 'NaN'])

I get:

df
   column A column B column C
0         1        4        8
1         1        4        8
2         1        4        8
3         2        4        8
4         2        4        8
5         2        4        8
6         3        4        8
7         3        4        8
8         3        4        8

so I know I can fill 2 columns based on a condition.

If I call:

df["column B"] = np.where( df["column A"] == 2 ,['4'],['NaN'])

I get:

   column A column B column C
0         1      NaN        8
1         1      NaN        8
2         1      NaN        8
3         2        4        8
4         2        4        8
5         2        4        8
6         3      NaN        8
7         3      NaN        8
8         3      NaN        8

so I know I can fill based on a condition on a column. (I assume this is treated as a boolean array)

However, If I call:

df["column B"], df["column C"] = np.where( df["column A"] == 2 ,['4', '8'],['NaN', 'NaN'])

I expect to get

   column A column B column C
0         1      NaN        NaN        
1         1      NaN        NaN        
2         1      NaN        NaN        
3         2        4        8
4         2        4        8
5         2        4        8
6         3      NaN        NaN        
7         3      NaN        NaN        
8         3      NaN        NaN        

but I actually get:

Traceback (most recent call last):
 ... pydev\_pydevd_bundle\pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<string>", line 2, in <module>
  File "<__array_function__ internals>", line 6, in where
ValueError: operands could not be broadcast together with shapes (9,) (2,) (2,) 

Is there a way to do what I am trying to do? I don't want to use 2 separate calls, because the dataframes I need this for are very large.


Solution

  • You are almost there! It is just a matter of "broadcasting".

    You could use any of the problems suggested by the others. Or use your same concept but reshape the inputs a little.

    Like this:

    # Reshape the condition, then transpose the output.
    df["column B"], df["column C"] = np.where( np.array(df["column A"] == 2).reshape(-1,1) ,['4', '8'],['NaN', 'NaN']).T
    

    Or like this:

    # Or just reshape the lists
    df["column B"], df["column C"] = np.where( df["column A"] == 2 ,np.array(['4', '8']).reshape(-1,1),np.array(['NaN', 'NaN']).reshape(-1,1))
    

    Output:

        column A    column B    column C
    0   1   NaN NaN
    1   1   NaN NaN
    2   1   NaN NaN
    3   2   4   8
    4   2   4   8
    5   2   4   8
    6   3   NaN NaN
    7   3   NaN NaN
    8   3   NaN NaN
    

    You can check numpy documentation on broadcasting to get the idea: https://numpy.org/doc/stable/user/basics.broadcasting.html