Search code examples
pythonpandasdata-sciencealpha-vantage

Alpha_Vantage API returning incorrect time series data


I am downloading time series data for the Euro to USD exchange rate using the alpha_vantage API in a python pandas dataframe. I am using this to practice using pandas and scikit learn to attempt to fit models to the data after joining additional columns of technical indicators. I successfully built a large dataframe of prices and technical indicators, but was surprised to find all the open, close, high and low prices are equal across every row. I know that this cannot be accurate. Is this a problem seen before with the alpha_vantage API?

#timeseries class from alpha_vantage module
ts = timeseries.TimeSeries(key = '(My Key)',output_format = 'pandas')
#price pandas dataframe
price_df = ts.get_daily(symbol = 'EURUSD', outputsize='full')[0]
#show dataframe
price_df
            1. open  2. high  3. low  4. close  5. volume
date                                                     
1998-01-02   1.0866   1.0866  1.0866    1.0866        0.0
1998-01-05   1.0776   1.0776  1.0776    1.0776        0.0
1998-01-06   1.0754   1.0754  1.0754    1.0754        0.0
1998-01-07   1.0733   1.0733  1.0733    1.0733        0.0
1998-01-08   1.0784   1.0784  1.0784    1.0784        0.0
1998-01-09   1.0764   1.0764  1.0764    1.0764        0.0
1998-01-12   1.0769   1.0769  1.0769    1.0769        0.0
1998-01-13   1.0755   1.0755  1.0755    1.0755        0.0
1998-01-14   1.0749   1.0749  1.0749    1.0749        0.0
1998-01-15   1.0699   1.0699  1.0699    1.0699        0.0
1998-01-16   1.0719   1.0719  1.0719    1.0719        0.0
1998-01-19   1.0669   1.0669  1.0669    1.0669        0.0
1998-01-20   1.0646   1.0646  1.0646    1.0646        0.0
1998-01-21   1.0722   1.0722  1.0722    1.0722        0.0
1998-01-22   1.0868   1.0868  1.0868    1.0868        0.0
1998-01-23   1.1002   1.1002  1.1002    1.1002        0.0

Solution

  • This is a "bad tick". Read the white paper from TickData https://s3-us-west-2.amazonaws.com/tick-data-s3/pdf/Tick_Data_Filtering_White_Paper.pdf

    The problem boils down to the interface between human/machine interaction. The point at which humans come in contact with the mechanical process is where errors originate in data.

    Page 4 of the paper outlines a case where three ticks in a row are 55 even:

    "Data points represent 3 bad ticks in succession. Interestingly, the bad ticks lie at the value 55.00. Most likely, these ticks appear as bad because the fractional portion of the price was “lost.”"

    In your example above, something similar happened, where a human error made one of the O/H/L/C column the value for all columns. I believe that due to the size of the data stored, something happened when the file was being compressed to be stored on the disk.

    Compression utilities try to save space by finding things which look similar, assigning them a common variable, then extracting the original value when demanded. But if the compression utility is not instructed to preserve minute fractional differences in numeric value, it may see all these values across a row as the same unless instructed otherwise. And so when the file is extracted, it reads from a compressed file which has corrupted the valuable data.

    I don't know that's the case for sure in your example, but it wouldn't surprise me if something similar happened.

    In any case, this is a great example of how difficult it can be to get reliable data.