Search code examples
pythonpandastype-conversiondecimal-point

python pandas - generic ways to deal with commas in string to float conversion with astype()


is there a generic way to tell pandas to use a comma (",") as decimal separators for type conversion from string to float etc.?

import pandas as pd
from datetime import datetime

data = {
    "col_str": ["a", "b", "c"],
    "col_int": ["1", "2", "3"],
    "col_float": ["1,2", "3,2342", "97837,8277"],
    "col_float2": ["13,2", "3234,2342", "263,8277"],
    "col_date": [datetime(2020, 8, 1, 0, 3, 4).isoformat(),
                 datetime(2020, 8, 2, 2, 4, 5).isoformat(),
                 datetime(2020, 8, 3, 6, 8, 4).isoformat()
                 ]
}

conversion_dict = {
    "col_str": str,
    "col_int": int,
    "col_float": float,
    "col_float2": float,
    "col_date": "datetime64"
}

df = pd.DataFrame(data=data)

print(df.dtypes)
df = df.astype(conversion_dict, errors="ignore")
print(df.dtypes)
print(df)

The above example returns object-columns for "col_float" and "col_float2" or throws an error is errors is set to "raise".

I would like to use the astype()-method directly without manually replacing commas with dots. The datasource generally returns floats with commas as decimal separator because the locale is set to german.

Is there a generic way to tell pandas as type that commas in floats - or any other numeric data type with decimals - are ok and should be converted automatically?

PS: I can´t use read_csv where you can specify the separator directly, because it´s a database.

Thanks in advance.


Solution

  • I fixed the problem with the following workaround. This still might break in some cases but I did not find a way to tell pands astype() that a comma is ok. If someone has another solution with pandas only, please let me know:

    import locale
    from datetime import datetime
    import pandas as pd
    
    data = {
        "col_str": ["a", "b", "c"],
        "col_int": ["1", "2", "3"],
        "col_float": ["1,2", "3,2342", "97837,8277"],
        "col_float2": ["13,2", "3234,2342", "263,8277"],
        "col_date": [datetime(2020, 8, 1, 0, 3, 4).isoformat(),
                     datetime(2020, 8, 2, 2, 4, 5).isoformat(),
                     datetime(2020, 8, 3, 6, 8, 4).isoformat()
                     ]
    }
    
    conversion_dict = {
        "col_str": str,
        "col_int": int,
        "col_float": float,
        "col_float2": float,
        "col_date": "datetime64"
    }
    
    df = pd.DataFrame(data=data)
    throw_error = True
    
    try:
        df = df.astype(conversion_dict, errors="raise")
    except ValueError as e:
        error_message = str(e).strip().upper()
        error_search = "COULD NOT CONVERT STRING TO FLOAT:"
        # compare error messages to only get the string to float error because pandas only throws ValueError´s which
        # are not datatype specific. This might be quite hacky because error messages could change.
        if error_message[:len(error_search)] == error_search:
            # convert everything else and ignore errors for the float columns
            df = df.astype(conversion_dict, errors="ignore")
            # go over the conversion dict
            for key, value in conversion_dict.items():
                # print(str(key) + ":" + str(value) + ":" + str(df[key].dtype))
                # only apply to convert-to-float-columns which are not already in the correct pandas type float64
                # if you don´t check for correctly classified types, .str.replace() throws an error
                if (value == float or value == "float") and df[key].dtype != "float64":
                    # df[key].apply(locale.atof) or anythin locale related is plattform dependant and therefore bad
                    # in my opinion
                    # locale settings for atof
                    # WINDOWS: locale.setlocale(locale.LC_ALL, 'deu_deu')
                    # UNIX: locale.setlocale(locale.LC_ALL, 'de_DE')
                    df[key] = pd.to_numeric(df[key].str.replace(',', '.'))
        else:
            if throw_error:
                # or do whatever is best suited for your use case
                raise ValueError(str(e))
            else:
                df = df.astype(conversion_dict, errors="ignore")
    
    print(df.dtypes)
    print(df)