Search code examples
pythondataframetime-seriesdata-analysismissing-data

How to deal with consecutive missing values of stock price in a time series using python?


I have a data frame consisting of two-time series describing two different stock prices, spanning over five years with an interval of approximately 2 minutes. I am struggling to decide how to deal with the missing values to build a meaningful model.

Some info about the data frame:-

Total number rows: 1315440

Number of missing values in Series_1: 1113923

Number of missing values in Series_2: 378952

Often there are missing values in 100+ consecutive rows, which is what makes me confused about how to deal with this dataset.

Below is a portion of the data, plots of Series_1 (column 2) and Series_2 (column_3).

enter image description here

Visualisation of Series_1: enter image description here

Visualisation of Series_2: enter image description here

Any advice would be appreciated. Thanks.


Solution

  • Depending on where your data come from, the missing data at a given time may mean that at this particular timestamp, out of the two stocks, an order was executed for one but not for the other. There is no reason in fact that two different stocks trade at exactly the same time. Certain dormant stocks with no liquidity can go for a long time without being traded while others are more active. Moreover, given that the precision of the data is down to the microsecond, no surprise that the trades on both stocks are not necessarily happening at the exact same microsecond. In this cases, it is safe to assume that the price of the stock was the last recorded transaction and update the missing values accordingly. Assuming you are using pandas, you could harmonize it by applying the pandas fillna method. Just make sure to sort your data frame beforehand:

    df.sort_values('Time', inplace=True)
    df['Series1'].fillna(method='ffill', inplace=True)
    df['Series2'].fillna(method='ffill', inplace=True)