Search code examples
pythonpandascsv

Pandas read_csv dtype read all columns but few as string


I'm using Pandas to read a bunch of CSVs. Passing an options json to dtype parameter to tell pandas which columns to read as string instead of the default:

dtype_dic= { 'service_id':str, 'end_date':str, ... }
feedArray = pd.read_csv(feedfile , dtype = dtype_dic)

In my scenario, all the columns except a few specific ones are to be read as strings. So instead of defining several columns as str in dtype_dic, I'd like to set just my chosen few as int or float. Is there a way to do that?

It's a loop cycling through various CSVs with differing columns, so a direct column conversion after having read the whole csv as string (dtype=str), would not be easy as I would not immediately know which columns that csv is having. (I'd rather spend that effort in defining all the columns in the dtype json!)

Edit: But if there's a way to process the list of column names to be converted to number without erroring out if that column isn't present in that csv, then yes that'll be a valid solution, if there's no other way to do this at csv reading stage itself.

Note: this sounds like a previously asked question but the answers there went down a very different path (bool related) which doesn't apply to this question. Pls don't mark as duplicate!


Solution

  • For Pandas 1.5.0+, there's an easy way to do this. If you use a defaultdict instead of a normal dict for the dtype argument, any columns which aren't explicitly listed in the dictionary will use the default as their type. E.g.

    from collections import defaultdict
    types = defaultdict(lambda: str, A="int", B="float")
    df = pd.read_csv("/path/to/file.csv", dtype=types, keep_default_na=False)
    

    (I haven't tested this, but I assume you still need keep_default_na=False)


    For older versions of Pandas:

    You can read the entire csv as strings then convert your desired columns to other types afterwards like this:

    df = pd.read_csv('/path/to/file.csv', dtype=str, keep_default_na=False)
    # example df; yours will be from pd.read_csv() above
    df = pd.DataFrame({'A': ['1', '3', '5'], 'B': ['2', '4', '6'], 'C': ['x', 'y', 'z']})
    types_dict = {'A': int, 'B': float}
    for col, col_type in types_dict.items():
        df[col] = df[col].astype(col_type)
    

    keep_default_na=False is necessary if some of the columns are empty strings or something like NA which pandas convert to NA of type float by default, which would make you end up with a mixed datatype of str/float

    Another approach, if you really want to specify the proper types for all columns when reading the file in and not change them after: read in just the column names (no rows), then use those to fill in which columns should be strings

    col_names = pd.read_csv('file.csv', nrows=0).columns
    types_dict = {'A': int, 'B': float}
    types_dict.update({col: str for col in col_names if col not in types_dict})
    pd.read_csv('file.csv', dtype=types_dict)