Search code examples
pythonpython-3.xpandasjupyter-notebookdata-analysis

reindexing data to DateTimeIndex results in all price data becoming NaN


gold = pd.read_csv('gold.csv')


gold['Date'] = gold['Date'].apply(pd.to_datetime, utc=True)


gold
                         Date   Price
0   1950-01-01 00:00:00+00:00   34.730
1   1950-02-01 00:00:00+00:00   34.730
2   1950-03-01 00:00:00+00:00   34.730
3   1950-04-01 00:00:00+00:00   34.730
4   1950-05-01 00:00:00+00:00   34.730
... ...                         ...
842 2020-03-01 00:00:00+00:00   1593.764
843 2020-04-01 00:00:00+00:00   1680.030
844 2020-05-01 00:00:00+00:00   1715.697
845 2020-06-01 00:00:00+00:00   1734.032
846 2020-07-01 00:00:00+00:00   1840.807


gold.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype              
---  ------  --------------  -----              
 0   Date    847 non-null    datetime64[ns, UTC]
 1   Price   847 non-null    float64            
dtypes: datetime64[ns, UTC](1), float64(1)
memory usage: 13.4 KB


gold = gold.set_index('Date')


gold.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 847 entries, 1950-01-01 00:00:00+00:00 to 2020-07-01 00:00:00+00:00
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Price   847 non-null    float64
dtypes: float64(1)
memory usage: 13.2 KB


gold        
Date                        Price
1950-01-01 00:00:00+00:00   34.730
1950-02-01 00:00:00+00:00   34.730
1950-03-01 00:00:00+00:00   34.730
1950-04-01 00:00:00+00:00   34.730
1950-05-01 00:00:00+00:00   34.730
...                         ...
2020-03-01 00:00:00+00:00   1593.764
2020-04-01 00:00:00+00:00   1680.030
2020-05-01 00:00:00+00:00   1715.697
2020-06-01 00:00:00+00:00   1734.032
2020-07-01 00:00:00+00:00   1840.807


years = pd.date_range('01/01/1950', '01/01/2020', freq='YS')


years
DatetimeIndex(['1950-01-01', '1951-01-01', '1952-01-01', '1953-01-01',
               '1954-01-01', '1955-01-01', '1956-01-01', '1957-01-01',
               '1958-01-01', '1959-01-01', '1960-01-01', '1961-01-01',
               '1962-01-01', '1963-01-01', '1964-01-01', '1965-01-01',
               '1966-01-01', '1967-01-01', '1968-01-01', '1969-01-01',
               '1970-01-01', '1971-01-01', '1972-01-01', '1973-01-01',
               '1974-01-01', '1975-01-01', '1976-01-01', '1977-01-01',
               '1978-01-01', '1979-01-01', '1980-01-01', '1981-01-01',
               '1982-01-01', '1983-01-01', '1984-01-01', '1985-01-01',
               '1986-01-01', '1987-01-01', '1988-01-01', '1989-01-01',
               '1990-01-01', '1991-01-01', '1992-01-01', '1993-01-01',
               '1994-01-01', '1995-01-01', '1996-01-01', '1997-01-01',
               '1998-01-01', '1999-01-01', '2000-01-01', '2001-01-01',
               '2002-01-01', '2003-01-01', '2004-01-01', '2005-01-01',
               '2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01',
               '2010-01-01', '2011-01-01', '2012-01-01', '2013-01-01',
               '2014-01-01', '2015-01-01', '2016-01-01', '2017-01-01',
               '2018-01-01', '2019-01-01', '2020-01-01'],
              dtype='datetime64[ns]', freq='AS-JAN')


reindexed = gold.reindex(years)


reindexed
            Price
1950-01-01  NaN
1951-01-01  NaN
1952-01-01  NaN
1953-01-01  NaN
1954-01-01  NaN
...         ...
2016-01-01  NaN
2017-01-01  NaN
2018-01-01  NaN
2019-01-01  NaN
2020-01-01  NaN

When I plot reindexed there are no values for the price after reindexing. I am trying to get gold reindexed by the first day of each year from 1950 to 2020, and then plot them on a pandas generated line graph. I have looked at various posts from Stack Overflow with similar issues, but I am still not sure what I am doing wrong.


Solution

  • You need to use an UTC localized index, convert with tz_localize:

    reindexed = gold.reindex(years.tz_localize('utc'))
    

    Output:

                               Price
    1950-01-01 00:00:00+00:00  34.73
    1951-01-01 00:00:00+00:00    NaN  # NaN as the data is
    1952-01-01 00:00:00+00:00    NaN  # not in your example
    1953-01-01 00:00:00+00:00    NaN
    1954-01-01 00:00:00+00:00    NaN
    ...                          ...
    2016-01-01 00:00:00+00:00    NaN
    2017-01-01 00:00:00+00:00    NaN
    2018-01-01 00:00:00+00:00    NaN
    2019-01-01 00:00:00+00:00    NaN
    2020-01-01 00:00:00+00:00    NaN
    
    [71 rows x 1 columns]
    

    You might also try to resample getting the first value per year:

    resampled = gold.resample('YS').first()
    

    Output:

                                  Price
    Date                               
    1950-01-01 00:00:00+00:00    34.730
    1951-01-01 00:00:00+00:00       NaN
    1952-01-01 00:00:00+00:00       NaN
    1953-01-01 00:00:00+00:00       NaN
    1954-01-01 00:00:00+00:00       NaN
    ...                             ...
    2016-01-01 00:00:00+00:00       NaN
    2017-01-01 00:00:00+00:00       NaN
    2018-01-01 00:00:00+00:00       NaN
    2019-01-01 00:00:00+00:00       NaN
    2020-01-01 00:00:00+00:00  1593.764
    
    [71 rows x 1 columns]