Search code examples
pythonpandasbinning

Dynamic bin per row in a dataset in Pandas


I am having trouble dynamically binning my dataset for further calculation. My goal is to have specific bin/labels for each individual row in my dataframe, based on a function, and have the corresponding label assign to the column 'action'.

My dataset is:

id  value1 value2 type  length  amount
1   0.9     1.0     X   10      ['A', 'B']
2   2.0     1.6     Y   80      ['A']
3   0.3     0.5     X   29      ['A', 'C'] 

The function is as follows:

    def bin_label_generator(amount):
        if amount< 2:
            amount= 2
        lower_bound = 1.0 - (1.0/amount) 
        mid_bound = 1.0
        upper_bound = 1.0 + (1.0/amount)
        thresholds = {
            'bins':[-np.inf, lower_bound, mid_bound, upper_bound, np.inf],
            'labels':[0, 1.0, 2.0, 3.0]
        }
        return thresholds

This is my current code, but it requires me to specify a row in order to cut. I would want this to happen automatically with the dictionary specified in the row itself.

# filter on type
filter_type_series = df['type'].str.contains('X')

# get amount of items in amount list
amount_series = df[filter_type_series ]['amount'].str.len()

# generate bins for each row in series
bins_series = amount_series.apply(bin_label_generator)

# get the max values to for binning
max_values = df[filter_type_series].loc[:, [value1, value2]].abs().max(1)

# following line requires a row index, what I do not want
df['action'] = pd.cut(max_values, bins=bins_series[0]['bins'], labels=bins_series[0]['labels'])

Solution

  • Found a fix myself, by just iterating over every single row in the series, and then adding it towards the columns in the actual df.

    type = 'X'
    
    first_df = df.copy()
    type_series = mst_df['type'].str.contains(type)
    
    # loop over every row to dynamically use pd.cut with bins/labels from specific row
    for index, row in mst_df[mst_series].iterrows():
    #     get the max value from rows
        max_val = row[[value1, value2]].abs().max()
        
    #     get amount of cables
        amount = len(row['amount'])
        
    #   get bins and labels for specific row
        bins_label_dict = bin_label_generator(amount)
        bins = bins_label_dict['bins']
        labels = bins_label_dict['labels']
        
    #     append label to row with max value
        first_df .loc[index, 'action'] = pd.cut([max_val], bins=bins, labels=labels)