Search code examples
pandasdatetimegoogle-cloud-datalab

Why does pandas DataFrame.append() give an error with timezone values?


I have a data frame that is being appended to in a loop (if there's a better way to iterively add rows to the end of a data frame then suggestions welcome). The following snippet of code gives an error:

import pandas as pd
import pytz
import datetime

x = 'astring'
t = (datetime.datetime(2018, 5, 31, 13, 15, 17, tzinfo=pytz.utc), datetime.datetime(2100, 5, 31, tzinfo=pytz.utc))
df = pd.DataFrame(columns=['a', 'b', 'c'])
df = df.append({'a': x, 'b': t[0], 'c': t[1]}, ignore_index=True)

TypeError                                 Traceback (most recent call last)
<ipython-input-161-0df455a78607> in <module>()
      2 t = (datetime.datetime(2018, 5, 31, 13, 15, 17, tzinfo=pytz.utc), datetime.datetime(2100, 5, 31, tzinfo=pytz.utc))
      3 df = pd.DataFrame(columns=['a', 'b', 'c'])
----> 4 df = df.append({'a': x, 'b': t[0], 'c': t[1]}, ignore_index=True)

/usr/local/envs/py3env/lib/python3.5/site-packages/pandas/core/frame.py in append(self, other, ignore_index, verify_integrity)
   5192 
   5193     _shared_docs['pivot_table'] = """
-> 5194         Create a spreadsheet-style pivot table as a DataFrame. The levels in
   5195         the pivot table will be stored in MultiIndex objects (hierarchical
   5196         indexes) on the index and columns of the result DataFrame

/usr/local/envs/py3env/lib/python3.5/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    211     a  1
    212     >>> df6 = pd.DataFrame([2], index=['a'])
--> 213     >>> df6
    214        0
    215     a  2

/usr/local/envs/py3env/lib/python3.5/site-packages/pandas/core/reshape/concat.py in get_result(self)
    406             mgrs_indexers = []
    407             for obj in self.objs:
--> 408                 mgr = obj._data
    409                 indexers = {}
    410                 for ax, new_labels in enumerate(self.new_axes):

/usr/local/envs/py3env/lib/python3.5/site-packages/pandas/core/internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   5201     expanded label indexer
   5202     """
-> 5203     mult = np.array(shape)[::-1].cumprod()[::-1]
   5204     return _ensure_platform_int(
   5205         np.sum(np.array(labels).T * np.append(mult, [1]), axis=1).T)

/usr/local/envs/py3env/lib/python3.5/site-packages/pandas/core/internals.py in concatenate_join_units(join_units, concat_axis, copy)
   5330 
   5331     # see if we are only masking values that if putted
-> 5332     # will work in the current dtype
   5333     try:
   5334         nn = n[m]

/usr/local/envs/py3env/lib/python3.5/site-packages/pandas/core/internals.py in <listcomp>(.0)
   5330 
   5331     # see if we are only masking values that if putted
-> 5332     # will work in the current dtype
   5333     try:
   5334         nn = n[m]

/usr/local/envs/py3env/lib/python3.5/site-packages/pandas/core/internals.py in get_reindexed_values(self, empty_dtype, upcasted_na)
   5601     for ax, indexer in indexers.items():
   5602         mgr_shape[ax] = len(indexer)
-> 5603     mgr_shape = tuple(mgr_shape)
   5604 
   5605     if 0 in indexers:

TypeError: data type not understood

However, the following snippet works fine:

x = 'astring'
t = (datetime.datetime(2018, 5, 31, 13, 15, 17), datetime.datetime(2100, 5, 31))
df = pd.DataFrame(columns=['a', 'b', 'c'])
df = df.append({'a': x, 'b': t[0], 'c': t[1]}, ignore_index=True)

And stranger, this is also OK:

t = (datetime.datetime(2018, 5, 31, 13, 15, 17, tzinfo=pytz.utc), datetime.datetime(2100, 5, 31, tzinfo=pytz.utc))
df = pd.DataFrame(columns=['b', 'c'])
df = df.append({'b': t[0], 'c': t[1]}, ignore_index=True)

What am I missing? I'm just adding more detail here because StackOverflow is complaining that I "need more detail" to submit this question, because I guess being exceptionally verbose is a good thing. Who knew?

pandas==0.23.0
pytz==2016.7

Solution

  • This looks like a compatibility issue between versions of the pandas and pytz libraries.

    I was able to reproduce the error that you obtained in Datalab, and I was able to solve it by upgrading to pandas==0.23.0 (I was using the default 0.22.0 that comes with a brand new Datalab instance) and pytz==2018.4. Also, according to some other Stack Overflow posts I've seen, there could be some issues with numpy, so just for double-checking, I am using numpy==1.14.3.

    In order to upgrade the library versions, you should:

    1. Create a new notebook, and run the command !pip install --upgrade pandas in the first cell. This installed pytz==2018.4 for me, but if it does not in your case, you can try installing it manually too.
    2. Restart the kernel by clicking on the "Reset session" option in Datalab.
    3. Run your code again, and see if it works now:

    Add the following lines to check that the versions I mentioned are in use:

    print(pd.__version__)
    print(pytz.__version__)
    print(np.__version__)