Search code examples
pythonpandascategorical-data

Pandas category dtypes ignored in read_csv()


I have a strange issue when I am loading my csv file in Pandas (version 1.0.3).

I want to convert automatically some columns to category. To this end, I created a dictionary with the column names and their type. Well, for one column it does actually works and for others not. I don't get any error.
Which might be the cause such that a column is not parsed into a category? Strange as it may seem, if I try to convert that column afterwards to category by casting it, the operation works perfectly. So at a first glance didn't seem to be a column mistype issue.

col_types = {
    'CURRENCY': "category",
    'PRODUCT': "category",
    'PRODUCT_TYPE': "category",
}

def parse_csv(path_location):
    df = pd.read_csv(
    path_location, 
    sep=';',
    engine='c',
    dtype=col_types,
    true_values=['Y', 'y'],
    false_values=['N', 'n'],
    converters=converters,
    usecols=['PRODUCT', 'PRODUCT_TYPE', 'PORTFOLIO_CURRENCY', 'NATIONALITY'],
    nrows=99)
    return df

The result I get by the function above is:

Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   PORTFOLIO_CURRENCY  198 non-null    category
 1   PRODUCT             198 non-null    object  
 2   PRODUCT_TYPE        198 non-null    object  
 3   AGE                 185 non-null    float64 
 4   NATIONALITY         198 non-null    object  
dtypes: category(1), float64(1), object(3)

Solution

  • Although I can't install 1.0.3 to test if version is the problem, I have tested it on 1.1.4 and It works as expected. Please update pandas to newest version, as there were a lot of fixes with categorical in v1.1.0.

    If it doesn't help, check provided converters and validate if CSV doesn't contain malformed data, such as wrong unicode, but I wouldn't expect problems of this kind.