Search code examples
pythonpandastime-series

How to correctly index and split a panda dataframe?


this is what my dataset looks like Darts time series object requires indexing. I am trying to index the field 'cases' and 'rain'.

I did try time-based indexing using: (Note: "train_test_split" is from the sklearn model of scikit_learn package)

csv_file = 'data.csv'
df = pd.read_csv(csv_file)

# Assuming the index has two columns: 'Year' and 'Week'
df['Year_Week'] = df['Year'].astype(str) + '-' + df['Week'].astype(str)

# Convert the 'Year_Week' column to a datetime format with Sunday as the start of the week
df['Year_Week'] = pd.to_datetime(df['Year_Week'] + '-0', format='%Y-%U-%w')

# Set 'Year_Week' as the index of the DataFrame
df.set_index('Year_Week', inplace=True)

# Now, DataFrame has a datetime index based on the 'Year' and 'Week' columns

# Split data into train and test sets
train_size = 0.8
test_size = 0.2

train_data, test_data = train_test_split(df, train_size=train_size, test_size=test_size)

# Create Darts TimeSeries objects for training and testing data
train_series = TimeSeries.from_dataframe(train_data[['Cases']], freq='W')
test_series = TimeSeries.from_dataframe(test_data[['Cases']], freq='W')
train_covariate_series = TimeSeries.from_dataframe(train_data[['Rain']], freq='W')
test_covariate_series = TimeSeries.from_dataframe(test_data[['Rain']], freq='W')

However, when I run print(len(test_series)) or print(len(train_series)), it returns the total number of rows in the original df.

I have also tried normal integer indexing but can't quite figure out how to make it work.


# Split data into train and test sets
train_size = 0.8
test_size = 0.2
random_state = 42

train_data, test_data = train_test_split(df, train_size=train_size, test_size=test_size, random_state=random_state)

# Create Darts TimeSeries objects with a simple integer index
train_series = TimeSeries.from_dataframe(train_data, value_cols=['Cases'])
test_series = TimeSeries.from_dataframe(test_data, value_cols=['Cases'])
train_covariate_series = TimeSeries.from_dataframe(train_data, value_cols=['Rain'])
test_covariate_series = TimeSeries.from_dataframe(test_data, value_cols=['Rain'])

which resulted in

    raise ValueError(message)
ValueError: Could not convert integer index to a pd.RangeIndex. Found non-unique step sizes/frequencies: {1, 2, 3, 4, 5, 6}. If any of those is the actual frequency, try passing it with fill_missing_dates=True and freq=your_frequency

along with some other errors.

How do I correctly index and split the dataframe?


Solution

  • Here's what you need:

    train_data, test_data = train_test_split(df, train_size=train_size, test_size=test_size, shuffle=False)
    

    Function train_test_split from scikit-learn.model_selection shuffles the data before sampling. You shouldn't do that when working with time series.

    You got that wrong result because when you created darts timeseries with freq='week', it added rows with .nan values for each week inbetween the time of your samples that did not contain an explicit value.