Search code examples
pythonpandasdataframedata-cleaning

How to create a personalised bucket column from Pandas cut() command with the auto-generated categories and bins?


I have a Pandas DataFrame with multiple columns. One of them is the year. From this column I want to create a new one with categorical values (I guess the therm is buckets), having the buckets auto-generated. It should result in something like that:

year_gr         year    other_cols
A (1909 - 1917) 1911    abc
B (1921 - 1930) 1923    def
C (1932 - 1941) 1935    ghi

I manage to create something close to it by:

year_gr = pd.cut(df.year, 10, labels=[
   'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'])
df['year_gr'] = year_gr

df.head()

year_gr year    other_cols
A       1911    abc
B       1923    def
C       1935    ghi

But how do I concatenate the rages generated automatically by pd.cut to my year_gr variable? I saw we can add the parameter retbins=True in the cut command to extract the bins, but I didn't manage to make use of it...

Thanks!


Solution

  • pd.cut results in a categorie object populated by interval objects. We use their .left and .right attributes to create the specified strings.

        import numpy as np, pandas as pd
        import string
    
        # test data:
        df=pd.DataFrame({"year":[1911,1923,1935,1911],"other_cols":["abc","def","ghi","jkl"]})
      Out:
               year other_cols
        0  1911        abc
        1  1923        def
        2  1935        ghi
        3  1911        jkl
    
        #create the intervals:
        cats=pd.cut(df.year,10)
    
        Out: cats.dtypes.categories
        IntervalIndex([(1910.976, 1913.4], (1913.4, 1915.8], (1915.8, 1918.2],...
    
        # char generator:
        gchar=(ch for ch in string.ascii_uppercase)
        dlbls= { iv:next(gchar) for iv in cats.dtypes.categories } #EDIT1
        # get the intervals and convert them to the specified strings:
        df["year_gr"]=[ f"{dlbls[iv]} ({int(np.round(iv.left))} - {int(np.round(iv.right))})" for iv in cats ] #EDIT1
       Out:
              year other_cols          year_gr
        0  1911        abc  A (1911 - 1913)
        1  1923        def  B (1921 - 1923)
        2  1935        ghi  C (1933 - 1935)
        3  1911        jkl  A (1911 - 1913)
    
        # align the columns:
        df= df.reindex(["year_gr","year","other_cols"], axis=1)
       Out:
                   year_gr  year other_cols
        0  A (1911 - 1913)  1911        abc
        1  B (1921 - 1923)  1923        def
        2  C (1933 - 1935)  1935        ghi
        3  A (1911 - 1913)  1911        jkl