I have a Pandas DataFrame with multiple columns. One of them is the year
. From this column I want to create a new one with categorical values (I guess the therm is buckets), having the buckets auto-generated. It should result in something like that:
year_gr year other_cols
A (1909 - 1917) 1911 abc
B (1921 - 1930) 1923 def
C (1932 - 1941) 1935 ghi
I manage to create something close to it by:
year_gr = pd.cut(df.year, 10, labels=[
'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'])
df['year_gr'] = year_gr
df.head()
year_gr year other_cols
A 1911 abc
B 1923 def
C 1935 ghi
But how do I concatenate the rages generated automatically by pd.cut
to my year_gr
variable?
I saw we can add the parameter retbins=True
in the cut
command to extract the bins, but I didn't manage to make use of it...
Thanks!
pd.cut results in a categorie object populated by interval objects. We use their .left and .right attributes to create the specified strings.
import numpy as np, pandas as pd
import string
# test data:
df=pd.DataFrame({"year":[1911,1923,1935,1911],"other_cols":["abc","def","ghi","jkl"]})
Out:
year other_cols
0 1911 abc
1 1923 def
2 1935 ghi
3 1911 jkl
#create the intervals:
cats=pd.cut(df.year,10)
Out: cats.dtypes.categories
IntervalIndex([(1910.976, 1913.4], (1913.4, 1915.8], (1915.8, 1918.2],...
# char generator:
gchar=(ch for ch in string.ascii_uppercase)
dlbls= { iv:next(gchar) for iv in cats.dtypes.categories } #EDIT1
# get the intervals and convert them to the specified strings:
df["year_gr"]=[ f"{dlbls[iv]} ({int(np.round(iv.left))} - {int(np.round(iv.right))})" for iv in cats ] #EDIT1
Out:
year other_cols year_gr
0 1911 abc A (1911 - 1913)
1 1923 def B (1921 - 1923)
2 1935 ghi C (1933 - 1935)
3 1911 jkl A (1911 - 1913)
# align the columns:
df= df.reindex(["year_gr","year","other_cols"], axis=1)
Out:
year_gr year other_cols
0 A (1911 - 1913) 1911 abc
1 B (1921 - 1923) 1923 def
2 C (1933 - 1935) 1935 ghi
3 A (1911 - 1913) 1911 jkl