this is what my dataset looks like Darts time series object requires indexing. I am trying to index the field 'cases' and 'rain'.
I did try time-based indexing using: (Note: "train_test_split" is from the sklearn model of scikit_learn package)
csv_file = 'data.csv'
df = pd.read_csv(csv_file)
# Assuming the index has two columns: 'Year' and 'Week'
df['Year_Week'] = df['Year'].astype(str) + '-' + df['Week'].astype(str)
# Convert the 'Year_Week' column to a datetime format with Sunday as the start of the week
df['Year_Week'] = pd.to_datetime(df['Year_Week'] + '-0', format='%Y-%U-%w')
# Set 'Year_Week' as the index of the DataFrame
df.set_index('Year_Week', inplace=True)
# Now, DataFrame has a datetime index based on the 'Year' and 'Week' columns
# Split data into train and test sets
train_size = 0.8
test_size = 0.2
train_data, test_data = train_test_split(df, train_size=train_size, test_size=test_size)
# Create Darts TimeSeries objects for training and testing data
train_series = TimeSeries.from_dataframe(train_data[['Cases']], freq='W')
test_series = TimeSeries.from_dataframe(test_data[['Cases']], freq='W')
train_covariate_series = TimeSeries.from_dataframe(train_data[['Rain']], freq='W')
test_covariate_series = TimeSeries.from_dataframe(test_data[['Rain']], freq='W')
However, when I run print(len(test_series))
or print(len(train_series))
, it returns the total number of rows in the original df
.
I have also tried normal integer indexing but can't quite figure out how to make it work.
# Split data into train and test sets
train_size = 0.8
test_size = 0.2
random_state = 42
train_data, test_data = train_test_split(df, train_size=train_size, test_size=test_size, random_state=random_state)
# Create Darts TimeSeries objects with a simple integer index
train_series = TimeSeries.from_dataframe(train_data, value_cols=['Cases'])
test_series = TimeSeries.from_dataframe(test_data, value_cols=['Cases'])
train_covariate_series = TimeSeries.from_dataframe(train_data, value_cols=['Rain'])
test_covariate_series = TimeSeries.from_dataframe(test_data, value_cols=['Rain'])
which resulted in
raise ValueError(message)
ValueError: Could not convert integer index to a pd.RangeIndex. Found non-unique step sizes/frequencies: {1, 2, 3, 4, 5, 6}. If any of those is the actual frequency, try passing it with fill_missing_dates=True and freq=your_frequency
along with some other errors.
How do I correctly index and split the dataframe?
Here's what you need:
train_data, test_data = train_test_split(df, train_size=train_size, test_size=test_size, shuffle=False)
Function train_test_split from scikit-learn.model_selection shuffles the data before sampling. You shouldn't do that when working with time series.
You got that wrong result because when you created darts timeseries with freq='week', it added rows with .nan values for each week inbetween the time of your samples that did not contain an explicit value.