I've been locked out for days trying to create a new categorical variable from the aggregation of values from a sequential variable. I'm migrating syntax scripts from SPSS to Python using Pandas In SPSS I used the following command to do what I need:
RECODE QTD_FTE_RAZAO (0=5) (1 thru 3000 = 0) (3000.01 thru 4000 = 1) (4000.01 thru 5000 = 2) (5000.01 thru 10000 = 3) (10000.01 thru 15000 = 4) (15000.01 thru 9999999999 = 5)
INTO COD_FTE_RAZAO.
EXECUTE.
I looked at the pandas cut function, but it doesn't work because it creates text objects and I honestly couldn't write the parameters I need.
I need to turn it into values because I will create an index. I will create 5 variable categories and in the end I will add the values and create an index. That's why I need to turn these ranges into numbers.
The easiest way to do this would be to do a function?
I'm completely lost, could someone help me please?
Using hand-generated data, pd.cut
with explicit bins turns val
into bucket categories.
df = pd.DataFrame([{'val': 5}, {'val': 500},{'val': 5000},{'val': 2000},{'val': 3500},{'val': 6000},{'val': 12000},{'val': 500000}])
df['vars'] = pd.cut(df['val'],bins=[1, 3000, 4000, 5000, 10000, 15000, float('inf')])
df
The result is the first two columns of the table below. If you want to turn them into integers of some sort, you can map the categories in index order with df['codes'] = df.vars.cat.codes
.
val vars codes
0 5 (1.0, 3000.0] 0
1 500 (1.0, 3000.0] 0
2 5000 (4000.0, 5000.0] 2
3 2000 (1.0, 3000.0] 0
4 3500 (3000.0, 4000.0] 1
5 6000 (5000.0, 10000.0] 3
6 12000 (10000.0, 15000.0] 4
7 500000 (15000.0, inf] 5
As user political scientist commented, if you prefer not having an intermediate column and want whatever labels you desire that are not necessarily the integer indices of those bins, pass labels=[0, 1, 2, 3, 4, 5]
(or a list containing the appropriate labels) to pd.cut
.