Search code examples
pythonpython-3.xdataframepandas-groupbysklearn-pandas

How to group categorical column followed by numerical column, and based on this group to bin the numerical values


I have a dataset, in which the column "Type" is basically shapes, and corresponding to that, the column "Volume" consists of volume to that shape

Now I need to do the following task:

  1. Group-by shape
  2. And for each shape, group-by volume
  3. And for each shape and volume, define a range and form bins

Input:

 Type             Volume

 Cylinder          100
 Square            300
 Cylinder          200
 Oval              100
 Square            320
 Cylinder          150
 Oval              600
 Round             1000
 Square            900
 Round             1500

Output:

 Type              Volume       Bin

 Cylinder          100            1
 Cylinder          150            1
 Cylinder          200            2
 Oval              100            1
 Oval              600            3
 Round             1000           1
 Round             1500           2
 Square            300            1
 Square            320            1
 Square            900            3

The bin would be as below:

1.Cylinder -> Bin1(100-200), Bin2(201-300) ....

2.Oval -> Bin1(100-200), ..... Bin3(500-600).... ....

Code:

  grouped=df_dim.groupby('Type', as_index=False)
  def test(group):
     return group.reset_index()
  def group_vol(group):
     groupedVol = 
         group.groupby(pd.cut(group["Target_BrimVol"],
         np.arange(0,5000,200)),as_index=False)

     return groupedVol.apply(test)

  gr = grouped.apply(group_vol)
  print(gr)

Solution

  • I think you can try the code below.

    testdf = df.groupby('Type',as_index=False).apply(lambda x: x.groupby(pd.cut(x["Vol"],np.arange(x["Volume"].min(),x["Volume"].max(),200)),as_index=False).apply(test))
    

    What is happening here is, the first groupby basically groups the Dataframe into "Type" category and then you want to group it based on the ranges. For this you can group it again using the lambda function using pd.cut function to make small cuts of interval based on your range. In this case I am just taking the max and min value and cutting it in intervals of 200. After this , if you want to merge the outputs back together forming a Dataframe again, using one more apply to merge them back. like this,

    def test(group):
       #Write your function here. Whatever you want to perform.
       return group.merge(group)
    

    And I am using as_index=False to reset the index here so that the dataframe is rearranged as per the new index.

    Hope this helps.

    Edit:- For bins, you do not have to worry because each groupby creates a new index, which you can use for your purpose. As in,

    Index1  Index2  Type  Volume
    0 0 Cylinder  100
    0 0 Cylinder  140
    0 1 Cylinder  250
    1 0 Oval  154
    1 4 Oval 999
    2 1 Circle  328