Search code examples
pythonmachine-learningtrain-test-splitstandardized

Should i Standadize and detrend before train\test split?


I'm new to python and trying to perform a random forest regression task. I import my dataset that has 5 columns in total (including date column). My data is time dependant so i cannot use the train/test split. So instead i do the following

feature_cols = [ 'Rainfall' ,'Temperature','Usage amount']
target_v = df['water level']
X = df[feature_cols] 
y = target_v 

then i use the time series split in sklearn to split my data into train and test

from sklearn.model_selection import TimeSeriesSplit
tss = TimeSeriesSplit(n_splits = 3)
for train_index, test_index in tss.split(X):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index,:]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

Now i need to perform preprocessing such as scaling my data and removing the mean (detrend). So my question is what am i supposed to do first? i.e. do i remove the mean first then scale my data or i first scale then remove the mean?

Also do i perform the 2 techniques on my entire dataframe (df) or on a subset of my data (i.e. just on the training data)? If it's a subset how do i do this?

Here is an example of scaling and mean removal i had tried on my entire dataframe

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns, index=df.index)

mean = np.mean((df.values), axis=-1, keepdims=True)
detrended = df - mean

Then i used detrended dataframe for spliting my data into train and test and ran my models. Am not sure if this is the correct approach? any help will be appretiated, thank you


Solution

  • You almost always standardize after train test splitting your data. When you’re later getting real world data to test your model on later, you aren’t going to be able to go back and modify your scalers or its going to mess with your model. Some might even consider including the test data in scaling overfitting since you’re kinda cheating by letting the network account for the test data while scaling.

    So what you should do first is Train Test Split. Then fit the Scaler to the training data, transform the training data with the Scaler, and then Transform the testing data using the same scaler without refitting. By doing this you ensure the same values are represented in the same way for all future data that could be pumped into the network

    ———————

    Sklearn has two functions: fit and transform. Fit updates the the parameters of the Scaler (how it scales data) to match your data. Transform Applies the Scaler to your data so your data is scaled. Sklearn also has another function that fits and transforms simultaneously (fit_transform).

    We want to Tune our scaler to our training set and then scale the data, so we run fit_transform. For our test set, we don’t want the Scaler to adjust how it scales data, we just want it to scale our data, so we scale in the same way we did for the training set. So this time we only run Transform.

    You can fit and transform on both your features and labels if you desire, though it’s a lot less important to scale your labels because you don’t end up with the same issues when it comes to differences in the scale of values between your features.