I am using a public dataset (Coca-cola stock prices) from Kaggle: https://www.kaggle.com/datasets/kalilurrahman/coca-cola-stock-live-and-updated/data
I noticed that the frequency of the dataset is daily but there are missing values for some days. In my copy (because the dataset is being updated frequently), there were 15311 entries and when I resampled the dataframe (with the date as the index) daily, the number of missing values were 6900+ missing values which also implies that there were 6900+ missing entries.
This is how I resampled the data to have a daily frequency:
import pandas as pd
data = pd.read_csv(filepath)
data_preprocessed = data.copy()
data_preprocessed['Date'] = pd.to_datetime(data['Date'].str.split(' ').str[0])
data_preprocessed.set_index('Date', inplace=True)
data_monthly = data_preprocessed.resample('D').last()
print(data_monthly.isnull().sum()) # Display how many missing days are there
I am conflicted on how I should approach these NaN values. Should I use ffill()
, bfill()
, interpolate()
, or drop()
them all completely?
Any responses will be highly appreciated. Thank you very much.
Yes so the dataset is 15311 and once resampled it shows as 22213.
data_monthly[data_monthly['Open'].isna()]
output
Open High Low Close Volume Dividends Stock Splits
Date
1962-01-06 NaN NaN NaN NaN NaN NaN NaN
1962-01-07 NaN NaN NaN NaN NaN NaN NaN
1962-01-13 NaN NaN NaN NaN NaN NaN NaN
1962-01-14 NaN NaN NaN NaN NaN NaN NaN
1962-01-20 NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ...
2022-10-09 NaN NaN NaN NaN NaN NaN NaN
2022-10-15 NaN NaN NaN NaN NaN NaN NaN
2022-10-16 NaN NaN NaN NaN NaN NaN NaN
2022-10-22 NaN NaN NaN NaN NaN NaN NaN
2022-10-23 NaN NaN NaN NaN NaN NaN NaN
6902 rows × 7 columns
If you then look at how many days are missing from each year it is about the same number each year.
data_monthly[data_monthly['Open'].isna()].index.year.value_counts().plot.bar(figsize=(10,6))
What it looks like is weekends, stock does not trade at weekends and holidays. So maybe do not resample.