Search code examples
pythonpandasparquetfeather

What format to export pandas dataframe while retaining data types? Not CSV; Sqlite? Parquet?


My workflow typically involves loading some data, typically from CSV files, into a pandas dataframe, cleansing it, defining what the right data type for each column is, then exporting it to a SQL server.

For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?

  • The only real solution I have tested is to export to a sqlite .db file, using the answer here to make sure dates are read as dates.
  • How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)

  • I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards compatibility

  • CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:

    • pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus
    • I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.

UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b

I seem to understand that another difference between HDF5 and Parquet is that datetime64 has no direct equivalent in Hdf5. Most people seem to store their dates in HDF5 as ISO-date-formatted (yyyy-mm-dd) strings.


Solution

  • If you really want to avoid pickle and saving a CSV (I don't fully agree with your statements about those not being feasible options) then you could run a local database server to save the data in and do a dump/restore process when the SQL server is available again. Otherwise:

    Use to_pickle method of the DataFrame object.

    Or, save a data type json file with your data types and specify your date format when saving the CSV:

    # export
    import json
    data_types = df.dtypes.astype(str).to_dict()
    with open('data_type_key.json', 'w') as f
        json.dump(data_types, f)
    df.to_csv('data.csv', date_format='%Y%m%d')
    
    # import
    data_types = json.loads('data_type_key.json')
    data_frame = pd.read_csv(your_csv_path, dtype=data_types)