Search code examples
pythonscaleminmax

How to find the number of rows and columns in a MinMaxScaler object?


I made a dataframe of a csv file and passed it into train_test_split and then used MinMaxScaler to scale the whole X and Y dataframes but now I want to know the basic number of rows and columns but can't.

df=pd.read_csv("cancer_classification.csv")
from sklearn.model_selection import train_test_split
X = df.drop("benign_0__mal_1",axis=1).values
y = df["benign_0__mal_1"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit(X_train)
X_test = scaler.fit(X_test)
X_train.shape

this is throwing the following error

AttributeError Traceback (most recent call last) in () ----> 1 X_train.shape

AttributeError: 'MinMaxScaler' object has no attribute 'shape'

I read the documentation and was able to find the number of rows using scale_ but not to find the columns. this is how the answer should look like but I was not able to find an attribute that can help enter image description here


Solution

  • MinMaxScaler is an object that can fit itself to certain data and also transform that data. There are

    • The fit method fits the scaler's parameters to that data. It then returns the MinMaxScaler object
    • The transforms method transforms data based on the scaler's fitted parameters. It then returns the transformed data.
    • The fit_transform method first fits the scaler to that data, then transforms it and returns the transformed version of the data.

    In your example, you are treating the MinMaxScaler object itself as the data! (see 1st bullet point)

    The same MinMaxScaler shouldn't be fitted twice on different dataset since its internal values will be changed. You should never fit a minmaxscaler on the test dataset since that's a way of leaking test data into your model. What you should be doing is fit_transform() on the training data and transform() on the test data.

    The answer here may also help this explanation: fit-transform on training data and transform on test data

    When you call StandardScaler.fit(X_train), what it does is calculate the mean and variance from the values in X_train. Then calling .transform() will transform all of the features by subtracting the mean and dividing by the variance. For convenience, these two function calls can be done in one step using fit_transform().

    The reason you want to fit the scaler using only the training data is because you don't want to bias your model with information from the test data.

    If you fit() to your test data, you'd compute a new mean and variance for each feature. In theory these values may be very similar if your test and train sets have the same distribution, but in practice this is typically not the case.

    Instead, you want to only transform the test data by using the parameters computed on the training data.