Search code examples
pythonpandasgroupingbinningdata-preprocessing

How to find the optimal bin width which will include most datapoints in Python


I'm searching for an easier solution to categorize my data. Based on two columns i'm trying to find the best combination of the two which will include most cases. For both of the two columns i've set an deviation of .25 above and below the value.

To illustrate my problem you could run the following script. The script does exactly what it should do, but it feels unclean and rubbish.

import pandas as pd

df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
                   'col2':[1,1,1,1.2,1.2,1.3,1.5,2,2,2.3,2.5,2.8,3,3,3],
                   'col3':[2,2.2,2.3,4.2,1.6,1.6,1.3,1.4,1.5,1.7,2.4,2.8,2.9,3,4]})

df_combinations = pd.DataFrame()

for i,r in df.iterrows():
    range_min_c2,range_max_c2 = r['col2'] -.25,r['col2'] +.25
    range_min_c3,range_max_c3 = r['col3'] -.25,r['col3'] +.25
    dft = df.copy(deep=True)
    dft['test'] = dft.apply(lambda x: 1 if range_min_c2 < x['col2'] < range_max_c2 and 
                                    range_min_c3 < x['col3'] < range_max_c3 else 0,axis=1)
    sum_test = sum(dft['test'])
    df_combinations = df_combinations.append(pd.DataFrame({'min_c2':[range_min_c2],'max_c2':[range_max_c2],
                                                           'min_c3':[range_min_c3],'max_c3':[range_max_c3],
                                                           'sum_test':[sum_test]}))  

My question is pretty simple: Is there an easier (more pretty) way to get this output? Perhaps a predefined function? Anyway, thanks in advance!


Solution

  • In order to create the dataframe df_combinations with the columns min_ and max_ for col2 and col3, one way to do it is using pandas.DataFrame as follows

    deviation = .25
    df_combinations = pd.DataFrame({'min_c2':df['col2'] -deviation,'max_c2':df['col2'] +deviation,
                                'min_c3':df['col3'] -deviation,'max_c3':df['col3'] +deviation})
    
    [Out]:
    
        min_c2  max_c2  min_c3  max_c3
    0     0.75    1.25    1.75    2.25
    1     0.75    1.25    1.95    2.45
    2     0.75    1.25    2.05    2.55
    3     0.95    1.45    3.95    4.45
    4     0.95    1.45    1.35    1.85
    5     1.05    1.55    1.35    1.85
    6     1.25    1.75    1.05    1.55
    7     1.75    2.25    1.15    1.65
    8     1.75    2.25    1.25    1.75
    9     2.05    2.55    1.45    1.95
    10    2.25    2.75    2.15    2.65
    11    2.55    3.05    2.55    3.05
    12    2.75    3.25    2.65    3.15
    13    2.75    3.25    2.75    3.25
    14    2.75    3.25    3.75    4.25
    

    Then, for the column sum_test, one has a variety of options.

    1. Using pandas.DataFrame.apply with a custom lambda function

      df_combinations['sum_test'] = df_combinations.apply(lambda x: sum((df['col2'] > x['min_c2']) &
                                                                       (df['col2'] < x['max_c2']) &
                                                                       (df['col3'] > x['min_c3']) &
                                                                       (df['col3'] < x['max_c3'])),axis=1)
      
      [Out]:
      
          min_c2  max_c2  min_c3  max_c3  sum_test
      0     0.75    1.25    1.75    2.25         2
      1     0.75    1.25    1.95    2.45         3
      2     0.75    1.25    2.05    2.55         2
      3     0.95    1.45    3.95    4.45         1
      4     0.95    1.45    1.35    1.85         2
      5     1.05    1.55    1.35    1.85         2
      6     1.25    1.75    1.05    1.55         1
      7     1.75    2.25    1.15    1.65         2
      8     1.75    2.25    1.25    1.75         2
      9     2.05    2.55    1.45    1.95         1
      10    2.25    2.75    2.15    2.65         1
      11    2.55    3.05    2.55    3.05         3
      12    2.75    3.25    2.65    3.15         3
      13    2.75    3.25    2.75    3.25         3
      14    2.75    3.25    3.75    4.25         1
      
    2. Using a list comprehension:

      df_combinations['sum_test'] = [sum([1 if x['min_c2'] < y['col2'] < x['max_c2'] and 
                                   x['min_c3'] < y['col3'] < x['max_c3'] else 0 for i,y in df.iterrows()]) for i,x in df_combinations.iterrows()]
      
      [Out]:
      
          min_c2  max_c2  min_c3  max_c3  sum_test
      0     0.75    1.25    1.75    2.25         2
      1     0.75    1.25    1.95    2.45         3
      2     0.75    1.25    2.05    2.55         2
      3     0.95    1.45    3.95    4.45         1
      4     0.95    1.45    1.35    1.85         2
      5     1.05    1.55    1.35    1.85         2
      6     1.25    1.75    1.05    1.55         1
      7     1.75    2.25    1.15    1.65         2
      8     1.75    2.25    1.25    1.75         2
      9     2.05    2.55    1.45    1.95         1
      10    2.25    2.75    2.15    2.65         1
      11    2.55    3.05    2.55    3.05         3
      12    2.75    3.25    2.65    3.15         3
      13    2.75    3.25    2.75    3.25         3
      14    2.75    3.25    3.75    4.25         1
      

    Notes: