Search code examples
pythondataframebucket

i have to bucket each column in a dataframe (8 quintile)


i have a dataframe which contains 4 columns, for each column we have to do bucketing (distribute data in 8 buckets) in such a way that bucketing should be done for first column and second column and so on iteratively without specifying the column name manually

this is the code i am trying

for col in df3.columns[0:]:
cb1 = np.linspace(min(col), max(col), 11)
df3.insert(2 ,'buckets',pd.cut(col, cb1, labels=np.arange(1, 11, 1)))
print(df3[col])

here df3 is the sample dataset

apple orange banana

5 2 6

6 4 6

2 8 9

4 7 0

the expected output is

apple orange banana bucket_apple bucket_orange bucket_banana

5 2 6 1 3 2

6 4 6 1 1 4

2 8 9 2 1 8

4 7 0 5 4 1

here the bucket column is specifying the bucket number with respect to data


Solution

  • Since the output is totally random, there's no correlation between your data columns and the bucket nums, you should generate the buckets separately in that case.

    for c in df.columns:
        df['bucket_' + c] = np.random.randint(8, size=(len(df))) + 1
    df # your random bucket df. 
    

    If you want the bucket to become equal size:

    for c in df.columns:
        arr = np.arange(8) + 1
        arr = np.repeat(arr, int(len(df))/8) # your df has to be divisible by 8
        np.random.shuffle(arr) # shuffle the array.
        df['bucket_' + c] = arr