python pandas dataframe array-broadcasting

Pandas - Creating multiple default columns from existing column

The follows on from: Pandas - creating 2 new columns based on 2 columns and a separate test column

But it's a different question in it's own right. It should be simpler!

In the referenced question the following one-liner is discussed for data-filling 2 new columns from 2 other columns, and dependent on the value of a third column:

df['Buyer ID'], df['Seller ID'] = zip(
    *np.where(df.buy_sell == 'Buy',
             (df.buyer_name,df.seller_name), 
             (df.seller_name,df.buyer_name)).T)

This works well - but when I try to simplify this to use fixed scalar values rather than corresponding values in other columns, it doesn't work.

For example, if I only have one possible buyer, John, and one possible Seller, Maggie, then the follow simpler construct should suffice:

df['Buyer ID'], df['Seller ID'] = zip(
    *np.where(df.buy_sell == 'Buy',
             ("John","Maggie"), 
             ("Maggie","John")).T)

This is failing on the inner np.where() call with:

operands could not be broadcast together with shapes

I've tried a few variations like wrapping the tuples in zip(), which changes the shape but I still get the error. I think the problem is that ("John","Maggie") is not returned as the contents of a single column. The tuple is expanded to mean >1 column?

This link also showed some promise: Changing certain values in multiple columns of a pandas DataFrame at once

But I think the solution assumes the columns you wish to edit already exist and that you only want the same single value placed in every column.

I can get around the problem by making several passes, but it's not ideal:

np.where(df.buy_sell == 'Buy', 'John', 'Maggie')

Ideally for each row, I want a single-pass solution extendible to N new columns being filled with different, fixed, default values, but all depending on a single (boolean) value in another column.

Any pointers on what I'm missing?

Solution

The answer from jezrael gives a perfectly good way of doing this. But to explain why only the first example works in the original question, I found the links below a useful reference:

https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html and https://eli.thegreenplace.net/2015/broadcasting-arrays-in-numpy/

I've applied the references to the case in hand.

To recap:

First case - this works:

np.where(df.buy_sell == 'Buy',(df.buyer_name,df.seller_name),(df.seller_name,df.buyer_name))

Second case - this doesn't work:

np.where(df.buy_sell == 'Buy',("John","Maggie"), ("Maggie","John"))

Third case - this does work:

np.where(df.buy_sell == 'Buy', 'John', 'Maggie')

What (I think!) is happening in the first case is an attempt to broadcast across:

(n,) (n,) (n,) - this is fine because all non-zero dimensions equal

In the second case it is

(n,) (2,) (2,) - this is not OK because non-equal dimensions eg n<>2 - the nature of tuples is that that they are (2,) and clash with the (n,) of the buy_sell.

In the final case it is

(n,) (1,) (1,) - this is the same as the above, however this works because you can stretch a 1 over n, so it doesn't clash.

So to construct something that works for the scalar case we need to alter the tuples:

(n,) (2,) (2,)

To avoid a mismatch we changed it to:

(n,) (2,1) (2,1)

Now this isn't obvious, but what numpy will automagically do to broadcast this is left-pad the (n,) to (1,n), giving us:

(1,n) (2,1) (2,1)

This way there are no mismatched dimensions >1, giving a broadcast object of (2,n) - 2 rows of n columns each. You can see this by manually applying the np.broadcast() to the 3 arrays and calling shape on the result.

It's important to understand the difference between (x,) and (x,1) to get how this works. Basically - (x,) only has 1 dimension, (x,1) has 2 dimensions where the 2nd dimension is restricted to a single value. See here for details: Difference between numpy.array shape (R, 1) and (R,)

So the desired result can be achieved using the construct below:

np.where(df.buy_sell == 'Buy', (["John"],["Maggie"]), (["Maggie"],["John"]))

The result is then transposed to get n rows of 2 columns, so each row can be passed as parameters into zip() to allow multiple assignments.

I'm pretty sure jezrael's solution effectively does the same thing, but in this case the buy_sell is given the extra dimension rather than the text outputs - but the same objective is achieved - keeping mismatched >1 dimensions on different axis.

In this case the buy_sell becomes (n,1) so we have

(n,1) (2,) (2,)

Which is left padded to

(n,1) (1,2) (1,2)

Giving a broadcast object of (n,2).

The nice thing about this solution is that no transpose is needed before applying zip().