Search code examples
pythonpandasspss

How create number categories from sequential variable in python?


I've been locked out for days trying to create a new categorical variable from the aggregation of values ​​from a sequential variable. I'm migrating syntax scripts from SPSS to Python using Pandas In SPSS I used the following command to do what I need:

RECODE QTD_FTE_RAZAO (0=5) (1 thru 3000 = 0) (3000.01 thru 4000 = 1) (4000.01 thru 5000 = 2) (5000.01 thru 10000 = 3) (10000.01 thru 15000 = 4) (15000.01 thru 9999999999 = 5)
INTO COD_FTE_RAZAO.
EXECUTE.

I looked at the pandas cut function, but it doesn't work because it creates text objects and I honestly couldn't write the parameters I need.

I need to turn it into values ​​because I will create an index. I will create 5 variable categories and in the end I will add the values ​​and create an index. That's why I need to turn these ranges into numbers.

The easiest way to do this would be to do a function?

I'm completely lost, could someone help me please?


Solution

  • Using hand-generated data, pd.cut with explicit bins turns val into bucket categories.

    df = pd.DataFrame([{'val': 5}, {'val': 500},{'val': 5000},{'val': 2000},{'val': 3500},{'val': 6000},{'val': 12000},{'val': 500000}])
    df['vars'] = pd.cut(df['val'],bins=[1, 3000, 4000, 5000, 10000, 15000, float('inf')])
    df
    

    The result is the first two columns of the table below. If you want to turn them into integers of some sort, you can map the categories in index order with df['codes'] = df.vars.cat.codes.

          val                vars  codes
    0       5       (1.0, 3000.0]      0
    1     500       (1.0, 3000.0]      0
    2    5000    (4000.0, 5000.0]      2
    3    2000       (1.0, 3000.0]      0
    4    3500    (3000.0, 4000.0]      1
    5    6000   (5000.0, 10000.0]      3
    6   12000  (10000.0, 15000.0]      4
    7  500000      (15000.0, inf]      5
    

    As user political scientist commented, if you prefer not having an intermediate column and want whatever labels you desire that are not necessarily the integer indices of those bins, pass labels=[0, 1, 2, 3, 4, 5] (or a list containing the appropriate labels) to pd.cut.