The basics of using Pandas where with callables seems simple.
np.random.seed(0)
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df["test"] = range(1,9)
def MyBool(x):
print(1)
return ( x > 0 )
def MyFunc(x1):
print(1)
return x1['A']
df.where(
cond = lambda x: MyBool(x),
other = lambda x: MyFunc(x) ,
)
In the code above, I am replacing the values of all columns with the value of column A whenever the value of the col is less than 0. Note, I know I don't need to use callables for this simple example.
Based on my analysis, this is what is happening under the hood.
First, MyFunc is evaluated where the argument is the df itself. This returns a 8x1 df (=A)
Second, the MyBool is evaluated which returns a 8x5 boolean df.
Third, (not sure about this last step) for all entries (i,j) where MyBool returned False, the value of the i'th row of the output of MyFunc is used to replace the current value of the df.
This leads me on to my question: how does this extend to the cases when MyFunc returns a dataframe with several columns and rows? How does the function determine which entries need to be replaced and with which values?
For illustrative purposes, suppose now that we want to divide B and C by 2 when test is equal to 5. The code I have provided below works but I don't quite understand how it determines which entries are to be replaced and with which values.
MyBool still returns a one dimensional vector but MyFunc returns a dataframe. If the previous logic I explained was correct, then shouldn't it replace each False entry with the dataframe? Indeed, if this were the case, the resulting dataframe should be bigger than the input df.
I've been reading the documentation and playing with different examples but can't figure this one out.
def MyBool(x):
output = x.test != 5
return output
def MyFunc(x1):
x1.loc[ x1.test == 5, ["B", "C"] ] /= 2
return x1
df.where(
cond = lambda x: MyBool(x),
other = lambda x: MyFunc(x.copy()),
axis = 0
)
The logic is quite simple, for each False
in the output of the cond
callable, the matching value in the result of other
will be used as replacement. If other
is a scalar, this value is used.
The matching value is identified by position if the callable returns an array, and by alignment for a DataFrame:
df = pd.DataFrame({'A': [1, 0], 'B': [0, 1]})
df.where(cond=lambda x: x==0,
other=lambda x: pd.DataFrame({'A': [10, 20]}, index=[1, 0]))
# A B
# 0 20 0.0
# 1 0 NaN
df.where(cond=lambda x: x==0, other=lambda x: [[7,8],[9,10]])
# A B
# 0 7 0
# 1 0 10
Therefore, your MyFunc
functions should return a scalar, a DataFrame (that will be aligned), or an array of the same shape as the input.
You can modify it to broadcast the values to all columns:
def MyFunc(x1):
print(1)
return np.broadcast_to(x1['A'].to_frame().values, df.shape)
Example:
df = pd.DataFrame([[1, -1, 0, 0],
[2, 0, -1, 0],
[3, 0, 0, -1]],
columns=['A', 'B', 'C', 'D'])
def MyBool(x):
return x >= 0
def MyFunc(x):
return np.broadcast_to(x['A'].to_frame().values, df.shape)
out = df.where(cond=MyBool, other=MyFunc)
# A B C D
# 0 1 1 0 0
# 1 2 0 2 0
# 2 3 0 0 3
Note that the callables should NOT modify the DataFrame in place.
This should be avoided:
def MyFunc(x1):
x1.loc[ x1.test == 5, ["B", "C"] ] /= 2
return x1
and could be replaced by a simple (without using where
):
df.loc[df['test'] == 5, ['B', 'C']] /= 2