The follows on from: Pandas - creating 2 new columns based on 2 columns and a separate test column
But it's a different question in it's own right. It should be simpler!
In the referenced question the following one-liner is discussed for data-filling 2 new columns from 2 other columns, and dependent on the value of a third column:
df['Buyer ID'], df['Seller ID'] = zip(
*np.where(df.buy_sell == 'Buy',
(df.buyer_name,df.seller_name),
(df.seller_name,df.buyer_name)).T)
This works well - but when I try to simplify this to use fixed scalar values rather than corresponding values in other columns, it doesn't work.
For example, if I only have one possible buyer, John, and one possible Seller, Maggie, then the follow simpler construct should suffice:
df['Buyer ID'], df['Seller ID'] = zip(
*np.where(df.buy_sell == 'Buy',
("John","Maggie"),
("Maggie","John")).T)
This is failing on the inner np.where() call with:
operands could not be broadcast together with shapes
I've tried a few variations like wrapping the tuples in zip(), which changes the shape but I still get the error. I think the problem is that ("John","Maggie") is not returned as the contents of a single column. The tuple is expanded to mean >1 column?
This link also showed some promise: Changing certain values in multiple columns of a pandas DataFrame at once
But I think the solution assumes the columns you wish to edit already exist and that you only want the same single value placed in every column.
I can get around the problem by making several passes, but it's not ideal:
np.where(df.buy_sell == 'Buy', 'John', 'Maggie')
Ideally for each row, I want a single-pass solution extendible to N new columns being filled with different, fixed, default values, but all depending on a single (boolean) value in another column.
Any pointers on what I'm missing?
The answer from jezrael gives a perfectly good way of doing this. But to explain why only the first example works in the original question, I found the links below a useful reference:
https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html and https://eli.thegreenplace.net/2015/broadcasting-arrays-in-numpy/
I've applied the references to the case in hand.
To recap:
First case - this works:
np.where(df.buy_sell == 'Buy',(df.buyer_name,df.seller_name),(df.seller_name,df.buyer_name))
Second case - this doesn't work:
np.where(df.buy_sell == 'Buy',("John","Maggie"), ("Maggie","John"))
Third case - this does work:
np.where(df.buy_sell == 'Buy', 'John', 'Maggie')
What (I think!) is happening in the first case is an attempt to broadcast across:
(n,) (n,) (n,) - this is fine because all non-zero dimensions equal
In the second case it is
(n,) (2,) (2,) - this is not OK because non-equal dimensions eg n<>2 - the nature of tuples is that that they are (2,) and clash with the (n,) of the buy_sell.
In the final case it is
(n,) (1,) (1,) - this is the same as the above, however this works because you can stretch a 1 over n, so it doesn't clash.
So to construct something that works for the scalar case we need to alter the tuples:
(n,) (2,) (2,)
To avoid a mismatch we changed it to:
(n,) (2,1) (2,1)
Now this isn't obvious, but what numpy will automagically do to broadcast this is left-pad the (n,) to (1,n), giving us:
(1,n) (2,1) (2,1)
This way there are no mismatched dimensions >1, giving a broadcast object of (2,n) - 2 rows of n columns each. You can see this by manually applying the np.broadcast()
to the 3 arrays and calling shape
on the result.
It's important to understand the difference between (x,) and (x,1) to get how this works. Basically - (x,) only has 1 dimension, (x,1) has 2 dimensions where the 2nd dimension is restricted to a single value. See here for details: Difference between numpy.array shape (R, 1) and (R,)
So the desired result can be achieved using the construct below:
np.where(df.buy_sell == 'Buy', (["John"],["Maggie"]), (["Maggie"],["John"]))
The result is then transposed to get n rows of 2 columns, so each row can be passed as parameters into zip()
to allow multiple assignments.
I'm pretty sure jezrael's solution effectively does the same thing, but in this case the buy_sell is given the extra dimension rather than the text outputs - but the same objective is achieved - keeping mismatched >1 dimensions on different axis.
In this case the buy_sell becomes (n,1) so we have
(n,1) (2,) (2,)
Which is left padded to
(n,1) (1,2) (1,2)
Giving a broadcast object of (n,2).
The nice thing about this solution is that no transpose is needed before applying zip()
.