Search code examples
pythonpandasdataframesum

Pandas dataframe IF in a SUM function


i have two data frame

     asks_price  asks_qty exchange_name_ask     bids_price  bids_qty exchange_name_bid
0      20156.51  0.000745          Coinbase      20153.28  0.000200          Coinbase
1      20157.52  0.050000          Coinbase      20152.27  0.051000          Coinbase
2      20158.52  0.000745          Coinbase      20151.28  0.000200          Kraken
3      20158.52  0.050000          FTX           20151.28  0.051000          Coinbase

I would like to group the same price, add quantity together and combine the name of the exchange like :

     asks_price  asks_qty exchange_name_ask     bids_price  bids_qty exchange_name_bid
0    20156.51  0.000745          Coinbase      20153.28  0.000200          Coinbase
1    20157.52  0.050000          Coinbase      20152.27  0.051000          Coinbase
2    20158.52  0.050745          CoinbaseFTX   20151.28  0.051200       KrakenCoinbase

I succeeded using

df.groupby(['asks_price', 'bids_price']).sum(False)

It's working BUT if this is the same name i dont want to concat it. I dont want CoinbaseCoinbase or KrakenKraken So i have to include an if in the SUM function , how can i do it Thanks


Solution

  • Try as follows:

    import pandas as pd
    
    # Changing `2: 'Kraken'` to `2: 'Coinbase' for `exchange_name_bid` to generate
    # an example that doesn't require concatenation
    data = {'asks_price': {0: 20156.51, 1: 20157.52, 2: 20158.52, 3: 20158.52}, 
            'asks_qty': {0: 0.000745, 1: 0.05, 2: 0.000745, 3: 0.05}, 
            'exchange_name_ask': {0: 'Coinbase', 1: 'Coinbase', 2: 'Coinbase', 
                                  3: 'FTX'}, 
            'bids_price': {0: 20153.28, 1: 20152.27, 2: 20151.28, 3: 20151.28}, 
            'bids_qty': {0: 0.0002, 1: 0.051, 2: 0.0002, 3: 0.051}, 
            'exchange_name_bid': {0: 'Coinbase', 1: 'Coinbase', 2: 'Coinbase', 
                                  3: 'Coinbase'}}
    df = pd.DataFrame(data)
    
    res = df.groupby(['asks_price', 'bids_price'], as_index=False).agg(
        {'asks_qty':'sum',
         'bids_qty':'sum',
         'exchange_name_ask': lambda x: ','.join(dict.fromkeys(x).keys()),
         'exchange_name_bid': lambda x: ','.join(dict.fromkeys(x).keys())
         })
    
    print(res)
    
       asks_price  bids_price  asks_qty  bids_qty exchange_name_ask  \
    0    20156.51    20153.28  0.000745    0.0002          Coinbase   
    1    20157.52    20152.27  0.050000    0.0510          Coinbase   
    2    20158.52    20151.28  0.050745    0.0512      Coinbase,FTX   
    
      exchange_name_bid  
    0          Coinbase  
    1          Coinbase  
    2          Coinbase  # no concatenation here
    

    Explanation

    • Use df.groupby.agg and split the functions to be applied to the different columns. I.e. sum for columns '*_qty', and a lambda function for columns 'exchange_name_*' with dict.fromkeys nested inside join. (As mentioned by @jezrael in the comments, dict.fromkeys has better performance than using Series.drop_duplicates. Cf. his answer elsewhere.)
    • Change ','.join to ''.join, if you don't want a delimiter. If you want to re-order the columns of res so that they match the order of the original df, use res = res.loc[:,df.columns].