Search code examples
python-3.xpandasdataframecategorical-data

pandas.dataframe.astype is not converting dtype


I am trying to convert some columns from object to Categorical columns.

    # dtyp_cat = 'category'
        # mapper = {'Segment':dtyp_cat,
        #           "Sub-Category":dtyp_cat,
        #           "Postal Code":dtyp_cat,
        #           "Region":dtyp_cat,
        #          }
    
    df.astype({'Segment':'category'})
    df.dtypes

But the output is still object type.
enter image description here

Dataset is hosted at:

url = r"https://raw.githubusercontent.com/jaegarbomb/TSF_GRIP/main/Retail_EDA/Superstore.csv"
df = pd.read_csv(url)

Solution

  • Do this:

    df['Segment'] = df.Segment.astype('category')
    

    Which returns

    RangeIndex: 9994 entries, 0 to 9993
    Data columns (total 13 columns):
     #   Column        Non-Null Count  Dtype   
    ---  ------        --------------  -----   
     0   Ship Mode     9994 non-null   object  
     1   Segment       9994 non-null   category
     2   Country       9994 non-null   object  
     3   City          9994 non-null   object  
     4   State         9994 non-null   object  
     5   Postal Code   9994 non-null   int64   
     6   Region        9994 non-null   object  
     7   Category      9994 non-null   object  
     8   Sub-Category  9994 non-null   object  
     9   Sales         9994 non-null   float64 
     10  Quantity      9994 non-null   int64   
     11  Discount      9994 non-null   float64 
     12  Profit        9994 non-null   float64 
    dtypes: category(1), float64(3), int64(2), object(7)
    memory usage: 946.9+ KB
    
    

    EDIT

    If you want to convert several columns (In your case, I suppose it is all that are objects, you need to drop those that aren't, convert what's left and then reattach the other columns.

    df2 = df.drop([ 'Postal Code', 'Sales', 'Quantity', 'Discount', 'Profit'], axis=1)
    df3 = df2.apply(lambda x: x.astype('category'))     
    
    

    which gives

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 9994 entries, 0 to 9993
    Data columns (total 8 columns):
     #   Column        Non-Null Count  Dtype   
    ---  ------        --------------  -----   
     0   Ship Mode     9994 non-null   category
     1   Segment       9994 non-null   category
     2   Country       9994 non-null   category
     3   City          9994 non-null   category
     4   State         9994 non-null   category
     5   Region        9994 non-null   category
     6   Category      9994 non-null   category
     7   Sub-Category  9994 non-null   category
    dtypes: category(8)
    memory usage: 115.2 KB
    

    I'll leave the appending the other columns to you. A hint would be:

    df4 = pd.concat([df3, df], axis=1, sort=False)
    df_final = df4.loc[:,~df4.columns.duplicated()]