Search code examples
pythonpandasdatetimebinning

Pandas pd.cut on Timestamps - "ValueError: bins must increase monotonically"


I am trying to split time series data into labelled segments like this:

import pandas as pd
import numpy as np

# Create example DataFrame of stock values
df = pd.DataFrame({
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

# Cut the date into sections 
today = df['date'].max()
bin_edges = [pd.Timestamp.min, today - pd.Timedelta('14 days'), today - pd.Timedelta('7 days'), pd.Timestamp.max]
df['Time Group'] = pd.cut(df['date'], bins=bin_edges, labels=['history', 'previous week', 'this week'])

But I am getting an error even though bin_edges does seem to be increasing monotonically..

Traceback (most recent call last):
  File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-42-00524c0a883b>", line 13, in <module>
    df['Time Group'] = pd.cut(df['date'], bins=bin_edges, labels=['history', 'previous week', 'this week'])
  File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\tile.py", line 228, in cut
    raise ValueError('bins must increase monotonically.')
ValueError: bins must increase monotonically.


In[43]: bin_edges
Out[43]: 
[Timestamp('1677-09-21 00:12:43.145225'),
 Timestamp('2011-01-11 00:00:00'),
 Timestamp('2011-01-18 00:00:00'),
 Timestamp('2262-04-11 23:47:16.854775807')]

Why is this happening?


Solution

  • This is a bug in pandas. Your edges need to be converted to numeric values in order to perform the cut, and by using pd.Timestamp.min and pd.Timestamp.max you're essentially setting the edges at the lower/upper bounds of what can be represented by 64bit integers. This is causing an overflow when trying to compare the edges for monotonicity, which is making it look like it's not monotonic increasing.

    Demonstration of the overflow:

    In [2]: bin_edges_numeric = [t.value for t in bin_edges]
    
    In [3]: bin_edges_numeric
    Out[3]:
    [-9223372036854775000,
     1294704000000000000,
     1295308800000000000,
     9223372036854775807]
    
    In [4]: np.diff(bin_edges_numeric)
    Out[4]:
    array([-7928668036854776616,      604800000000000,  7928063236854775807],
          dtype=int64)
    

    Until this is fixed, my recommendation is to use a lower/upper that's closer to your actual dates but still achieves the same end result:

    first = df['date'].min()
    today = df['date'].max()
    bin_edges = [first - pd.Timedelta('1000 days'), today - pd.Timedelta('14 days'),
                 today - pd.Timedelta('7 days'), today + pd.Timedelta('1000 days')]
    

    I picked 1000 days arbitrarily, and you could choose a different value as you see fit. With these modifications the cut should be not raise an error.