Search code examples
pythonpandasdataframetimedelta

How to properly convert Pandas DataFrame which contains timedelta columns to/from JSON?


I've got an application where I'm trying to convert a Pandas DataFrame to and from a JSON object, and I'm running into an issue when the df contains a Timedelta object. I'm using Pandas 1.2.4.

Here's the sample df that I've been using:

>>> timedelta_df = pd.DataFrame({'datetime': pd.Series(['2013-12-31T00:00:00.000Z'], dtype='datetime64[ns]'),
                                 'days': pd.Series([pd.Timedelta(days=1)])})
>>> timedelta_df
    datetime   days
0 2013-12-31 1 days
>>> timedelta_df.dtypes
datetime     datetime64[ns]
days        timedelta64[ns]
dtype: object

I then have been using to_json and read_json to convert the df into JSON and back to a df:

>>> js_result = timedelta_df.to_json()
>>> js_result
'{"datetime":{"0":1388448000000},"days":{"0":86400000}}'
>>> result_df = pd.read_json(js_result)
>>> result_df
    datetime      days
0 2013-12-31  86400000
>>> result_df.dtypes
datetime    datetime64[ns]
days                 int64
dtype: object

And then to attempt to get the correct type again I've been using astype, which appears to be where I'm running into issues:

>>> result_df = result_df.astype(timedelta_df.dtypes.to_dict())
>>> result_df
    datetime                   days
0 2013-12-31 0 days 00:00:00.086400
>>> result_df.dtypes
datetime     datetime64[ns]
days        timedelta64[ns]
dtype: object

So I'm getting the correct type, but the value is incorrect.

Next I tried using the iso date format, but I get an error there instead:

>>> iso_js_result = timedelta_df.to_json(date_format='iso')
>>> iso_js_result
'{"datetime":{"0":"2013-12-31T00:00:00.000Z"},"days":{"0":"P1DT0H0M0S"}}'
>>> iso_results_df = pd.read_json(iso_js_result)
>>> iso_results_df
                   datetime        days
0 2013-12-31 00:00:00+00:00  P1DT0H0M0S
>>> iso_results_df = iso_results_df.astype(timedelta_df.dtypes.to_dict())
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "F:\temp\virtualEnvironments\inference_schema_py37_dev\lib\site-packages\pandas\core\generic.py", line 5862, in astype
    col.astype(dtype=dtype[col_name], copy=copy, errors=errors)
  File "F:\temp\virtualEnvironments\inference_schema_py37_dev\lib\site-packages\pandas\core\generic.py", line 5877, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "F:\temp\virtualEnvironments\inference_schema_py37_dev\lib\site-packages\pandas\core\internals\managers.py", line 631, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "F:\temp\virtualEnvironments\inference_schema_py37_dev\lib\site-packages\pandas\core\internals\managers.py", line 427, in apply
    applied = getattr(b, f)(**kwargs)
  File "F:\temp\virtualEnvironments\inference_schema_py37_dev\lib\site-packages\pandas\core\internals\blocks.py", line 673, in astype
    values = astype_nansafe(vals1d, dtype, copy=True)
  File "F:\temp\virtualEnvironments\inference_schema_py37_dev\lib\site-packages\pandas\core\dtypes\cast.py", line 1074, in astype_nansafe
    return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
  File "pandas\_libs\lib.pyx", line 619, in pandas._libs.lib.astype_intsafe
ValueError: Could not convert object to NumPy timedelta

At this point I feel like I'm missing something. I've mostly been going off of the API reference docs for to_json, read_json, and astype, and nothing I've tried in terms of parameters there have solved this for me. I also tried using to_timedelta on the specific column (not ideal because I'll need to dynamically figure out which columns to run this on in the real application), but I get the same incorrect value there.

Any help/pointers on what I should be doing here, if there is a proper approach, would be much appreciated. Thanks.


Solution

  • The problem with result_df.astype(timedelta_df.dtypes.to_dict()) resulting in wrong values is that the datatype of the days column is timedelta64[ns], i.e. it expects nanoseconds, whereas to_json defaults to serializing timedeltas as milliseconds.

    One simple way to fix this would therefore be to explicitly serialize it as nanoseconds: timedelta_df.to_json(date_unit="ns").

    >>> result_df = pd.read_json(timedelta_df.to_json(date_unit="ns"))
    >>> result_df.astype(timedelta_df.dtypes)
        datetime   days
    0 2013-12-31 1 days
    

    Another way would be to tell pd.to_timedelta which units to expect:

    >>> result_df = pd.read_json(timedelta_df.to_json())
    >>> pd.to_timedelta(result_df.days, unit="ms")
    0   1 days
    Name: days, dtype: timedelta64[ns]
    

    Or for the iso format:

    >>> result_df = pd.read_json(timedelta_df.to_json(date_format='iso')
    >>> pd.to_timedelta(result_df.days)
    0   1 days
    Name: days, dtype: timedelta64[ns]