Search code examples
pythonpandasdataframecategorical-data

Factorize a column of strings in pandas


As the question says, I have a data frame df_original which is quite large but looks like:

        ID    Count   Column 2   Column 3  Column 4
RowX    1      234.     255.       yes.      452
RowY    1      123.     135.       no.       342
RowW    1      234.     235.       yes.      645
RowJ    1      123.     115.       no.       342
RowA    1      234.     285.       yes.      233
RowR    1      123.     165.       no.       342
RowX    2      234.     255.       yes.      234
RowY    2      123.     135.       yes.      342
RowW    2      234.     235.       yes.      233
RowJ    2      123.     115.       yes.      342
RowA    2      234.     285.       yes.      312
RowR    2      123.     165.       no.       342
.
.
.
RowX    1233   234.     255.       yes.      133
RowY    1233   123.     135.       no.       342
RowW    1233   234.     235.       no.       253
RowJ    1233   123.     115.       yes.      342
RowA    1233   234.     285.       yes.      645
RowR    1233   123.     165.       no.       342

I am trying to get rid of the text data and replace it with a predefined numerical equivalent. For example, in this case, I'd like to replace Column3's yes or no values with 1 or 0 respectively. Is there a way to do this without me having to manually go in and alter the values?


Solution

  • series
    
    RowX    yes
    RowY     no
    RowW    yes
    RowJ     no
    RowA    yes
    RowR     no
    RowX    yes
    RowY    yes
    RowW    yes
    RowJ    yes
    RowA    yes
    RowR     no
    Name: Column 3, dtype: object
    

    pd.factorize

    1 - series.factorize()[0]
    array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])
        
    

    np.where

    np.where(series == 'yes', 1, 0)
    array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0])
    

    pd.Categorical/astype('category')

    pd.Categorical(series).codes
    array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0], dtype=int8)
    
    series.astype('category').cat.codes
    
    RowX    1
    RowY    0
    RowW    1
    RowJ    0
    RowA    1
    RowR    0
    RowX    1
    RowY    1
    RowW    1
    RowJ    1
    RowA    1
    RowR    0
    dtype: int8
    

    pd.Series.replace

    series.replace({'yes' : 1, 'no' : 0})
     
    RowX    1
    RowY    0
    RowW    1
    RowJ    0
    RowA    1
    RowR    0
    RowX    1
    RowY    1
    RowW    1
    RowJ    1
    RowA    1
    RowR    0
    Name: Column 3, dtype: int64
    

    A fun, generalised version of the above:

    series.replace({r'^(?!yes).*$' : 0}, regex=True).astype(bool).astype(int)
    
    RowX    1
    RowY    0
    RowW    1
    RowJ    0
    RowA    1
    RowR    0
    RowX    1
    RowY    1
    RowW    1
    RowJ    1
    RowA    1
    RowR    0
    Name: Column 3, dtype: int64
    

    Anything that is not "yes" is 0.