Search code examples
pythonpandascsvdbase

How to make pandas to use nulls for int64 column when reading CSV file


I'm trying to read an old DBase file exported to CSV and some columns are just empty. First I had a problem with converting integer columns to float but tanks to @Nathan's answer here Pandas read_csv dtype read all columns but few as string the problem was resolved. After I had the right column types - using code bellow:

def read_csv(file_name):
    # todo set correct data types for the columns
    inferred_types = {}
    columns = pd.read_csv(file_name + '.csv', header=0, index_col=1, squeeze=True, keep_default_na=False, nrows=0).columns
    for col in columns:
        col_type = col.split(',')
        try:
            if len(col_type) < 2:
                inferred_types[col] = str
            elif col_type[1] == 'C':
                inferred_types[col] = str
            elif col_type[1] == 'N': 
                if 'EGN' in col:  # special case
                    inferred_types[col] = str
                else:
                    if col_type[3] == '0':
                        inferred_types[col] = np.int64
                    else:
                        inferred_types[col] = np.float64
            else:
                inferred_types[col] = str
        except Exception as e:
            print(f'{file_name} {col} -> {e}')

    df = pd.read_csv(
        file_name + '.csv', header=0, index_col=1, squeeze=True, keep_default_na=False, dtype=inferred_types
    )
    return df

I got a ValueError when pandas hit an empty cell in a column.

What I'm missing here, please?

EDIT: Here are the first few rows of one of the problematic files:

,"CODE_CURR,N,2,0","CURRENCIES,C,20","CUOZN,C,3","FOR_WHAT,N,5,0","CURS_DT,N,13,7","DATE_VAL,C,8","DATE_ACT,C,8","TIME_ACT,C,8","ID_NUMBER,C,2","SUBS_CODE,C,7","USER_CODE,C,7"
0,1,австралийски долари,AUD,,46.665,,,,,,
1,2,австрийски шилинги,ATS,,5.758,,,,,,
2,3,белгийски франкове,BEF,,1.969,,,,,,

The problem is that the column "FOR_WHAT,N,5,0" (which should be integer) is entirely empty, so the ValueError: ValueError: invalid literal for int() with base 10: ''

EDIT2: I would highly appreciate any workaround! Data set is not so big, so performance is not an issue here.


Solution

  • You have 2 ways to use NULL values (in database sense) in a Pandas column containing integer value.

    1. the still official way: convert the column to float64 and use NaN for NULL values.

      The nice thing is that np.nan support is good in most database adapters, so all NaN values should be automatically converted to NULL database values if you insert (or update) them in a database. The downside is that float64 cannot hold exactly integer values higher than 2**48 (IEEE 754 mantissa is only 48 bits).

    2. the experimental way: use the new pd.Int64Dtype

      This new type can hold any 64 bit integer value and a special pd.NA value. So it provides exactly what you want. The downside here is that the documentation explicitely says:

      IntegerArray is currently experimental. Its API or implementation may change without warning.

      Long story short, it may work or not for your use case (support in the database adapter) and you could have to adapt your code if something change in a later version.