I have a dataframe (df) with 5 columns. 4 of the columns are dtype: object, and one is dtype: int. For simplicity let's say Columns 1-4 are objects and Column 5 is int dtype. I'm interested in converting Column 1 from an object dtype to an integer. It has a format of randomly created alphanumeric combinations like 0000000-1111111-aaaaaaaaa-bbbbbbb to zzzz99-abc1234-jfkslnfnsl120-204875987, with a total of 5000 unique values
Here is what I have tried so far. I've tried straight datatype conversions like
df.column1.astype('int')
df.column1..astype(theano.config.floatX)
But I get errors about how the conversion isn't possible that way.
I've also tried creating a new column and mapping integer values for each unique value in Column1 to use as a work-around, but I haven't had any luck. The code looked something like this:
np_arange = np.arange(1, 5000, 1)
df.int_column = df.column1.map(np_arange)
or
num_range = range(1, 5000, 1)
df.int_column = df.column1.map(num_range)
Here I get errors saying that the numpy arrays aren't callable, but I can't think of any other way to get around this. Does anyone have any ideas for how I could complete this?
Edit: The dataframe looks something like this (except more columns and rows):
df = pd.DataFrame({
'Column1': ['00000-aaaa-1111-bbbbn', 'zzzz-1820-2222-vvvv', '4124-ce69-11f5-0293'],
'Column2': [76, 25, 89],
'Column3': ['MW', 'NA', 'BL'],
'Column4': ['Car', 'Truck', 'Bike'],
'Column5': ['OH', 'WE', 'SC']
})
And I need either another column where for every '0000-aaaa-1111-bbbb' value, there is a corresponding 1 value in the new column, and for every 'zzzz-1820-2222-vvvv' value in the column 1 for there to be a 2 in the new column - or a way to convert the alphanumeric combinations to integer
Combine select_dtypes
and factorize
:
df.update(df.select_dtypes(exclude='number').apply(lambda s: pd.factorize(s)[0]+1))
Output:
Column1 Column2 Column3 Column4 Column5
0 1 76 1 1 1
1 2 25 2 2 2
2 3 89 3 3 3