Search code examples
pythonmachine-learningscikit-learnnormalizationpca

Why do we need to pass the transpose of a data into sklearn StandardScaler()?


I have a set of data that each row that represents a gene, and each column represents a sample. enter image description here

I want to normalize it first and then perform PCA on it. I googled online and find out that we need to pass the transpose of the data frame into the sklearn.preprocessing.StandardScaler() function.

Here is my code now:

scale_df = sklearn.preprocessing.StandardScaler().fit_transform(df.iloc[:, 1:col].T)

pca = sklearn.decomposition.PCA()
pca_data = pca.fit_transform(scale_df.T)

Here is the part I am not sure about. First, why do we need to pass the transpose of the data into StandardScaler()? Second, after we pass the transpose of the data into it, the scaled data frame we get is still transposed, will this affect our result of the PCA? Should we transpose it back to normal before we pass it to PCA?


Solution

  • The scikit-learn library follows a convention that rows represent unit of observation (person, product, country, etc) and columns represent different characteristics (height, weights, money, etc). Since your data has a sample in each column (which I think is the unit of observation), you need to transpose your data to adopt to the convention.

    You do not need to transpose back scaled data, because both StandardScalar and PCA (and the most of scikit-learn classes) follow the same convention (units in rows).