Search code examples
pythonpandasvalueerror

reindex from a duplicate axis


I have the following code:

import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime

TOKEN = "d0d2a3295349c625be6c0cbe23f9136221eb45ef"
con = fxcmpy.fxcmpy(access_token=TOKEN, log_level='error')
symbols = con.get_instruments()

start = datetime.datetime(2015,1,1)
end = datetime.datetime.today()
data = con.get_candles('NGAS', period='D1', start = start, end = end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')
data = data.set_index(data.index.normalize())
full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)

The last line data = data.reindex(full_dates) gives me the following error:

ValueError: cannot reindex from a duplicate axis

What i am trying to do is fill the missing dates and reindex the column.


As mentioned by @jezrael "problem is duplicated values in DatetimeIndex, so reindex cannot be used here"

I have used the same code earlier and it worked fine. Curious why it is not working in this case

import pandas as pd
from pandas import datetime
from pandas import DataFrame as df
import matplotlib
from pandas_datareader import data as web
import matplotlib.pyplot as plt
import datetime
import numpy as np

stock = 'F'
start = datetime.date(2008,1,1)
end = datetime.date.today()
data = web.DataReader(stock, 'yahoo',start, end)
data.index = pd.to_datetime(data.index, format ='%Y-%m-%d')

full_dates = pd.date_range(start, end)
data = data.reindex(full_dates)

The code is the same except the provider but this one works and the one above did not?


Solution

  • So problem is duplicated values in DatetimeIndex, so reindex cannot be used here.

    Possible solution is use DataFrame.join with helper DataFrame by all values:

    data = data.set_index(data.index.normalize())
    full_dates = pd.date_range(start, end)
    df = pd.DataFrame({'date':full_dates}).join(data, on='date')
    print (df)
               date  bidopen  bidclose  bidhigh  bidlow  askopen  askclose  \
    0    2015-01-01      NaN       NaN      NaN     NaN      NaN       NaN   
    1    2015-01-02   2.9350     2.947   3.0910   2.860   2.9450     2.957   
    2    2015-01-03      NaN       NaN      NaN     NaN      NaN       NaN   
    3    2015-01-04      NaN       NaN      NaN     NaN      NaN       NaN   
    4    2015-01-05   2.9470     2.912   3.1710   2.871   2.9570     2.922   
            ...      ...       ...      ...     ...      ...       ...   
    1797 2019-12-03   2.3890     2.441   2.5115   2.371   2.3970     2.449   
    1798 2019-12-04   2.3455     2.392   2.3970   2.341   2.3535     2.400   
    1798 2019-12-04   2.4410     2.406   2.4645   2.370   2.4490     2.414   
    1799 2019-12-05   2.4060     2.421   2.4650   2.399   2.4140     2.429   
    1800 2019-12-06      NaN       NaN      NaN     NaN      NaN       NaN   
    
          askhigh  asklow  tickqty  
    0         NaN     NaN      NaN  
    1       3.101  2.8700  12688.0  
    2         NaN     NaN      NaN  
    3         NaN     NaN      NaN  
    4       3.181  2.8810  21849.0  
          ...     ...      ...  
    1797    2.519  2.3785  36679.0  
    1798    2.406  2.3505   5333.0  
    1798    2.473  2.3780  74881.0  
    1799    2.473  2.4070  29238.0  
    1800      NaN     NaN      NaN  
    
    [1802 rows x 10 columns]
    

    But I think next processing should be problematic (because duplicated index), so use DataFrame.resample by days with aggregation functions in dictionary:

    df = data.resample('D').agg({'bidopen': 'first', 
                                 'bidclose': 'last',
                                 'bidhigh': 'max', 
                                 'bidlow': 'min', 
                                 'askopen': 'first', 
                                 'askclose': 'last',
                                 'askhigh': 'max', 
                                 'asklow': 'min', 
                                 'tickqty':'sum'})
    

    print (df)
                bidopen  bidclose  bidhigh  bidlow  askopen  askclose  askhigh  \
    date                                                                         
    2015-01-02   2.9350    2.9470   3.0910   2.860   2.9450    2.9570    3.101   
    2015-01-03      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
    2015-01-04      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
    2015-01-05   2.9470    2.9120   3.1710   2.871   2.9570    2.9220    3.181   
    2015-01-06   2.9120    2.9400   2.9510   2.807   2.9220    2.9500    2.961   
                ...       ...      ...     ...      ...       ...      ...   
    2019-12-01      NaN       NaN      NaN     NaN      NaN       NaN      NaN   
    2019-12-02   2.3505    2.3455   2.3670   2.292   2.3590    2.3535    2.375   
    2019-12-03   2.3890    2.4410   2.5115   2.371   2.3970    2.4490    2.519   
    2019-12-04   2.3455    2.4060   2.4645   2.341   2.3535    2.4140    2.473   
    2019-12-05   2.4060    2.4210   2.4650   2.399   2.4140    2.4290    2.473   
    
                asklow  tickqty  
    date                         
    2015-01-02  2.8700    12688  
    2015-01-03     NaN        0  
    2015-01-04     NaN        0  
    2015-01-05  2.8810    21849  
    2015-01-06  2.8170    17955  
               ...      ...  
    2019-12-01     NaN        0  
    2019-12-02  2.3000    31173  
    2019-12-03  2.3785    36679  
    2019-12-04  2.3505    80214  
    2019-12-05  2.4070    29238  
    
    [1799 rows x 9 columns]