Search code examples
pythonpandasmachine-learningdata-analysis

Categorize Data in a column in dataframe


I have a column of numbers in my dataframe, i want to categorize these numbers into e.g high , low, excluded. How do i accomplish that. I am clueless , i have tried looking at the cut function and category datatype.


Solution

  • A short example with pd.cut.

    Let's start with some data frame:

    df = pd.DataFrame({'A': [0, 8, 2, 5, 9, 15, 1]})
    

    and, say, we want to assign the numbers to the following categories: 'low' if a number is in the interval [0, 2], 'mid' for (2, 8], 'high' for (8, 10], and we exclude numbers above 10 (or below 0).

    Thus, we have 3 bins with edges: 0, 2, 8, 10. Now, we can use cut as follows:

    pd.cut(df['A'], bins=[0, 2, 8, 10], include_lowest=True)
    Out[33]: 
    0     [0, 2]
    1     (2, 8]
    2     [0, 2]
    3     (2, 8]
    4    (8, 10]
    5        NaN
    6     [0, 2]
    Name: A, dtype: category
    Categories (3, object): [[0, 2] < (2, 8] < (8, 10]]
    

    The argument include_lowest=True includes the left end of the first interval. (If you want intervals open on the right, then use right=False.)

    The intervals are probably not the best names for the categories. So, let's use names: low/mid/high:

    pd.cut(df['A'], bins=[0, 2, 8, 10], include_lowest=True, labels=['low', 'mid', 'high'])
    Out[34]: 
    0     low
    1     mid
    2     low
    3     mid
    4    high
    5     NaN
    6     low
    Name: A, dtype: category
    Categories (3, object): [low < mid < high]
    

    The excluded number 15 gets a "category" NaN. If you prefer a more meaningful name, probably the simplest solution (there're other ways to deal with NaN's) is to add another bin and a category name, for example:

    pd.cut(df['A'], bins=[0, 2, 8, 10, 1000], include_lowest=True, labels=['low', 'mid', 'high', 'excluded'])
    Out[35]: 
    0         low
    1         mid
    2         low
    3         mid
    4        high
    5    excluded
    6         low
    Name: A, dtype: category
    Categories (4, object): [low < mid < high < excluded]