Search code examples
python-3.xpandashdffeather

OverflowError while saving large Pandas df to hdf


I have a large Pandas dataframe (~15GB, 83m rows) that I am interested in saving as an h5 (or feather) file. One column contains long ID strings of numbers, which should have string/object type. But even when I ensure that pandas parses all columns as object:

df = pd.read_csv('data.csv', dtype=object)
print(df.dtypes)  # sanity check
df.to_hdf('df.h5', 'df')

> client_id                object
  event_id                 object
  account_id               object
  session_id               object
  event_timestamp          object
  # etc...

I get this error:

  File "foo.py", line 14, in <module>
    df.to_hdf('df.h5', 'df')
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/generic.py", line 1996, in to_hdf
    return pytables.to_hdf(path_or_buf, key, self, **kwargs)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 279, in to_hdf
    f(store)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 273, in <lambda>
    f = lambda store: store.put(key, value, **kwargs)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 890, in put
    self._write_to_group(key, value, append=append, **kwargs)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 1367, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2963, in write
    self.write_array('block%d_values' % i, blk.values, items=blk_items)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/pytables.py", line 2730, in write_array
    vlarr.append(value)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/tables/vlarray.py", line 547, in append
    self._append(nparr, nobjects)
  File "tables/hdf5extension.pyx", line 2032, in tables.hdf5extension.VLArray._append
OverflowError: value too large to convert to int

Apparently it is trying to convert this to an int anyway, and failing.

When running df.to_feather() I have a similar issue:

df.to_feather('df.feather')
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/core/frame.py", line 1892, in to_feather
    to_feather(self, fname)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pandas/io/feather_format.py", line 83, in to_feather
    feather.write_dataframe(df, path)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/feather.py", line 182, in write_feather
    writer.write(df)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/feather.py", line 93, in write
    table = Table.from_pandas(df, preserve_index=False)
  File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 501, in dataframe_to_arrays
    convert_fields))
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 487, in convert_column
    raise e
  File "/shared_directory/projects/env/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 481, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert 1542852887489 with type str: tried to convert to double', 'Conversion failed for column session_id with type object')

So:

  1. Is anything that looks like a number forcibly converted to a number in storage?
  2. Could the presence of NaNs affect what's happening here?
  3. Is there an alternative storage solution? What would be the best?

Solution

  • Having done some reading on this topic, it seems like the issue is dealing with string-type columns. My string columns contain a mixture of all-number strings and strings with characters. Pandas has the flexible option of keeping strings as an object, without a declared type, but when serializing to hdf5 or feather the content of the column is converted to a type (str or double, say) and cannot be mixed. Both of these libraries fail when confronted with a sufficiently large library of mixed type.

    Force-converting my mixed column to strings allowed me to save it in feather, but in HDF5 the file ballooned and the process ended when I ran out of disk space.

    Here is an answer in a comparable case where a commenter notes (2 years ago) "This problem is very standard, but solutions are few".

    Some background:

    String types in Pandas are called object, but this obscures that they may either be pure strings or mixed dtypes (numpy has builtin string types, but Pandas never uses them for text). So the first thing to do in a case like this is to enforce all string cols as string type (with df[col].astype(str)). But even so, in a large enough file (16GB, with long strings) this still failed. Why?

    The reason I was encountering this error was that I had data that long and high-entropy (many different unique values) strings. (With low-entropy data, it might have been worthwhile to switch to categorical dtype.) In my case, I realised that I only needed these strings in order to identify rows - so I could replace them with unique integers!

    df[col] = df[col].map(dict(zip(df[col].unique(), range(df[col].nunique()))))
    

    Other Solutions:

    For text data, there are other recommended solutions than hdf5/feather, including:

    • json
    • msgpack (note that in Pandas 0.25 read_msgpack is deprecated)
    • pickle (which has known security issues, so be careful - but it should be OK for internal storage/transfer of dataframes)
    • parquet, part of the Apache Arrow ecosystem.

    Here is an answer from Matthew Rocklin (one of the dask developers) comparing msgpack and pickle. He wrote a broader comparison on his blog.