Search code examples
pythondataframedata-bindingcategorization

How to automatically categorise data in panda dataframe?


I have a data frame with more than 1000 rows and 200 columns something like this:

     my_data:
             ID,   f1,   f2, ..     ,f200   Target
             x1     3     0, ..     ,2      0
             x2     6     2, ..     ,1      1
             x3     5     4, ..     ,0      0
             x4     0     5, ..     ,18     1
             ..     .     ., ..     ,..     .
             xn     13    0, ..     ,4      0

First, I want to automatically discretize these features (f1-f200) into four groups as no, low, medium and high, so that the Ids which have zero in their columns (e.g., x1 in f2 contains 0, the same in xn .. ) should be labels "no", the rest should be categorized into low, medium and high.

I found this:

  pd.cut(my_data,3, labels=["low", "medium", "high"]) 

But, this does not solve the problem. Any idea?


Solution

  • So, you need to create dynamic bins and iterate columns to get this. This can be done by below:

    new_df = pd.DataFrame()
    for name,value in df1.iteritems(): ##df1 is your dataframe
        bins = [-np.inf, 0,df1[name].min()+1,df1[name].mean(), df1[name].max()]
        new_df[name] = pd.cut(df1[name], bins=bins, include_lowest=False, labels=['no','low', 'mid', 'high'])