Search code examples
pythonpandasgroup-bybinning

How I can dynamically define bins for grouped data using pandas?


Having a dataset, I have to group it in several ways (with MultiIndex), perform some aggregations and export results. One of these operations is binning (bucketing) on column with prices to obtain amount of elements in each bucket. I need 3 buckets where:

  • in first bucket I obtain all elements for which the price is equal to the lowest price for that grouping or equal or not higher than 110% of the lowest price
  • in second bucket I obtain all elements for which the price is higher than 110% of the lowest price, but or equal or not higher than 150% of the lowest price
  • in third bucket - the rest.

Example:

Product Country Sell col price colb
First DE A b 100 x
Second DE A g 105 z
First FR A b 111 x
Second FR A g 100 z
First DE B b 109 x
Second DE B g 120 z
First FR B b 100 x
Second FR B g 200 z

What I expect:

Product Country Sell 1x 1.1x >1.5x
First DE A 1 0 0
B 0 1 0
FR A 0 1 0
B 1 0 0
Second DE A 1 0 0
B 0 1 0
FR A 1 0 0
B 0 0 1

Now, what I'm trying to do is following:

import numpy as np
import pandas as pd

# some code

df_low_price = df.groupby(["Product","Country","Sell"])["price"].sort_values(by="price").nth(0)
df_low_price_1_1x = df_low_price.map(lambda n: n * 1.1)
df_low_price_1_5x = df_low_price.map(lambda n: n * 1.5)

df_main = pd.concat([
df_low_price,
df_low_price_1_1x,
df_low_price_1_5x, axis=1
])

I'm getting the boundaries, but I don't get the size of the groups. I know that I should rely on pd.cut or cut, but I have no idea how to do it pythonic/pandas way. Thank you in advance for any suggestions.


Solution

    • just followed what you describe
    • to bin cut
    • multi-index and columns groupby/agg and unstack()
    df = pd.DataFrame({'Product': ['First',
      'Second',
      'First',
      'Second',
      'First',
      'Second',
      'First',
      'Second'],
     'Country': ['DE', 'DE', 'FR', 'FR', 'DE', 'DE', 'FR', 'FR'],
     'Sell': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
     'col': ['b', 'g', 'b', 'g', 'b', 'g', 'b', 'g'],
     'price': [100, 105, 111, 100, 109, 120, 100, 200],
     'colb': ['x', 'z', 'x', 'z', 'x', 'z', 'x', 'z'],
     'bin': ['1x', '1x', '1.1x', '1x', '1x', '1.1x', '1x', '>1.5x']})
    
    df["bin"] = pd.cut(
        df["price"],
        bins=[
            df["price"].min() - 1,
            df["price"].min() * 1.1,
            df["price"].min() * 1.5,
            df["price"].max(),
        ],
        labels=["1x", "1.1x", ">1.5x"],
    )
    
    df.groupby(["Product","Country","Sell","bin"]).agg({"col":"count"}).unstack().droplevel(0,1)
    
    

    output

    bin                   1x  1.1x  >1.5x
    Product Country Sell                 
    First   DE      A      1     0      0
                    B      1     0      0
            FR      A      0     1      0
                    B      1     0      0
    Second  DE      A      1     0      0
                    B      0     1      0
            FR      A      1     0      0
                    B      0     0      1