Search code examples
python-3.xpandasnumpymapping

How to convert an alphanumberic column (object dtype) to int?


I have a dataframe (df) with 5 columns. 4 of the columns are dtype: object, and one is dtype: int. For simplicity let's say Columns 1-4 are objects and Column 5 is int dtype. I'm interested in converting Column 1 from an object dtype to an integer. It has a format of randomly created alphanumeric combinations like 0000000-1111111-aaaaaaaaa-bbbbbbb to zzzz99-abc1234-jfkslnfnsl120-204875987, with a total of 5000 unique values

Here is what I have tried so far. I've tried straight datatype conversions like

df.column1.astype('int') 

df.column1..astype(theano.config.floatX) 

But I get errors about how the conversion isn't possible that way.

I've also tried creating a new column and mapping integer values for each unique value in Column1 to use as a work-around, but I haven't had any luck. The code looked something like this:

np_arange = np.arange(1, 5000, 1)
df.int_column = df.column1.map(np_arange)

or

num_range = range(1, 5000, 1)
df.int_column = df.column1.map(num_range)

Here I get errors saying that the numpy arrays aren't callable, but I can't think of any other way to get around this. Does anyone have any ideas for how I could complete this?

Edit: The dataframe looks something like this (except more columns and rows):

df = pd.DataFrame({
     'Column1': ['00000-aaaa-1111-bbbbn', 'zzzz-1820-2222-vvvv', '4124-ce69-11f5-0293'],
     'Column2': [76, 25, 89],
     'Column3': ['MW', 'NA', 'BL'],
     'Column4': ['Car', 'Truck', 'Bike'],
     'Column5': ['OH', 'WE', 'SC']
})

And I need either another column where for every '0000-aaaa-1111-bbbb' value, there is a corresponding 1 value in the new column, and for every 'zzzz-1820-2222-vvvv' value in the column 1 for there to be a 2 in the new column - or a way to convert the alphanumeric combinations to integer


Solution

  • Combine select_dtypes and factorize:

    df.update(df.select_dtypes(exclude='number').apply(lambda s: pd.factorize(s)[0]+1))
    

    Output:

    
      Column1  Column2 Column3 Column4 Column5
    0       1       76       1       1       1
    1       2       25       2       2       2
    2       3       89       3       3       3