Search code examples
pythonpandasdataframearray-broadcasting

Pandas - Creating multiple default columns from existing column


The follows on from: Pandas - creating 2 new columns based on 2 columns and a separate test column

But it's a different question in it's own right. It should be simpler!

In the referenced question the following one-liner is discussed for data-filling 2 new columns from 2 other columns, and dependent on the value of a third column:

df['Buyer ID'], df['Seller ID'] = zip(
    *np.where(df.buy_sell == 'Buy',
             (df.buyer_name,df.seller_name), 
             (df.seller_name,df.buyer_name)).T)

This works well - but when I try to simplify this to use fixed scalar values rather than corresponding values in other columns, it doesn't work.

For example, if I only have one possible buyer, John, and one possible Seller, Maggie, then the follow simpler construct should suffice:

df['Buyer ID'], df['Seller ID'] = zip(
    *np.where(df.buy_sell == 'Buy',
             ("John","Maggie"), 
             ("Maggie","John")).T)

This is failing on the inner np.where() call with:

operands could not be broadcast together with shapes

I've tried a few variations like wrapping the tuples in zip(), which changes the shape but I still get the error. I think the problem is that ("John","Maggie") is not returned as the contents of a single column. The tuple is expanded to mean >1 column?

This link also showed some promise: Changing certain values in multiple columns of a pandas DataFrame at once

But I think the solution assumes the columns you wish to edit already exist and that you only want the same single value placed in every column.

I can get around the problem by making several passes, but it's not ideal:

np.where(df.buy_sell == 'Buy', 'John', 'Maggie') 

Ideally for each row, I want a single-pass solution extendible to N new columns being filled with different, fixed, default values, but all depending on a single (boolean) value in another column.

Any pointers on what I'm missing?


Solution

  • The answer from jezrael gives a perfectly good way of doing this. But to explain why only the first example works in the original question, I found the links below a useful reference:

    https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html and https://eli.thegreenplace.net/2015/broadcasting-arrays-in-numpy/

    I've applied the references to the case in hand.

    To recap:

    First case - this works:

    np.where(df.buy_sell == 'Buy',(df.buyer_name,df.seller_name),(df.seller_name,df.buyer_name))
    

    Second case - this doesn't work:

    np.where(df.buy_sell == 'Buy',("John","Maggie"), ("Maggie","John"))
    

    Third case - this does work:

    np.where(df.buy_sell == 'Buy', 'John', 'Maggie') 
    

    What (I think!) is happening in the first case is an attempt to broadcast across:

    (n,) (n,) (n,) - this is fine because all non-zero dimensions equal

    In the second case it is

    (n,) (2,) (2,) - this is not OK because non-equal dimensions eg n<>2 - the nature of tuples is that that they are (2,) and clash with the (n,) of the buy_sell.

    In the final case it is

    (n,) (1,) (1,) - this is the same as the above, however this works because you can stretch a 1 over n, so it doesn't clash.

    So to construct something that works for the scalar case we need to alter the tuples:

    (n,) (2,) (2,)

    To avoid a mismatch we changed it to:

    (n,) (2,1) (2,1)

    Now this isn't obvious, but what numpy will automagically do to broadcast this is left-pad the (n,) to (1,n), giving us:

    (1,n) (2,1) (2,1)

    This way there are no mismatched dimensions >1, giving a broadcast object of (2,n) - 2 rows of n columns each. You can see this by manually applying the np.broadcast() to the 3 arrays and calling shape on the result.

    It's important to understand the difference between (x,) and (x,1) to get how this works. Basically - (x,) only has 1 dimension, (x,1) has 2 dimensions where the 2nd dimension is restricted to a single value. See here for details: Difference between numpy.array shape (R, 1) and (R,)

    So the desired result can be achieved using the construct below:

    np.where(df.buy_sell == 'Buy', (["John"],["Maggie"]), (["Maggie"],["John"]))
    

    The result is then transposed to get n rows of 2 columns, so each row can be passed as parameters into zip() to allow multiple assignments.

    I'm pretty sure jezrael's solution effectively does the same thing, but in this case the buy_sell is given the extra dimension rather than the text outputs - but the same objective is achieved - keeping mismatched >1 dimensions on different axis.

    In this case the buy_sell becomes (n,1) so we have

    (n,1) (2,) (2,)

    Which is left padded to

    (n,1) (1,2) (1,2)

    Giving a broadcast object of (n,2).

    The nice thing about this solution is that no transpose is needed before applying zip().