In my university project, I'm given data that has various ranges of values also not normal distribution. I already read the documentation of function of sklearn normalization it said normalization is a process of scaling individual samples to have unit norm. Also in sklearn there are Normalization
and StandardScaler
they seemed to have same function that is to scale the data. But then I read this article telling the differences between scaling and normalization distinguishing between them by saying that Normalization is the way for you to reach normal distribution and Scaling is the way for you to range you data.
Normalization has different meanings depending on the context and sometimes the term is misleading. I think sklearn uses the terms interchangeably, to mean adjusting values measured on different scales to a notionally common scale (e.g., between 0 and 1), rather than change the data such that they follow a Normal distribution (apart from the StandardScaler, which does that).
From my understanding, in sklearn they differ in the input they work on and how, and where they can be used.
I assume that with Normalization
you mean sklearn.preprocessing.Normalizer
.
So, the main difference is that sklearn.preprocessing.Normalizer
scales samples to unit norm (vector lenght) while sklearn.preprocessing.StandardScaler
scales features to unit variance, after subtracting the mean.
Therefore, the former works on the rows, while the latter on the columns.
In particular,
sklearn.preprocessing.normalize
"scales input vectors individually to unit norm (vector length).'. It can either be applied to rows (by setting the parameter axis
to 1) and to features/columns (by setting the parameter axis
to 0). It uses one of the following norms: l1
, l2
, or max
to normalize each non zero sample (or each non-zero feature if the axis is 0).
Note: The term norm here refers to the mathematical definition. See here and here for more information.
sklearn.preprocessing.Normalizer
"normalizes samples individually to unit norm.". It behaves exactly as sklearn.preprocessing.normalize
when axis=1
. Differently from normalize
, Normalizer
performs normalization using the Transformer API (e.g. as part of a preprocessing sklearn.pipeline.Pipeline).
sklearn.preprocessing.StandardScaler
"standardizes features by removing the mean and scaling to unit variance". It does not use the norm of a vector, rather it computes the z-score for each feature.
This interesting article explore more the differences among them.
Let's use norm='max'
for convenience:
from sklearn.preprocessing import normalize, Normalizer, StandardScaler
X = [[1, 2],
[2, 4]]
# Normalize column based on the maximum of each column (x/max(column))
normalize(X, norm='max', axis=0)
# Normalize column based on the maximum of each row (x/max(row))
normalize(X, norm='max', axis=1)
# Normalize with Normalizer (only rows)
Normalizer(norm='max').fit_transform(X)
# Standardize with StandardScaler (only columns)
StandardScaler().fit_transform(X)
from sklearn.pipeline import Pipeline
pipe = Pipeline([('normalization_step', normalize())] # NOT POSSIBLE
pipe = Pipeline([('normalization_step', Normalizer())] # POSSIBLE
pipe = Pipeline([('normalization_step', StandardScaler())] # POSSIBLE
pipe.score(X, y) # Assuming y exists
The aforementioned lines of code would transform the data as follows:
# Normalize with normalize, axis=0 (columns)
[[0.5, 0.5],
[1. , 1. ]]
# Normalize with normalize, axis=1 (rows)
[[0.5, 1],
[0.5, 1. ]]
# Normalize with Normalizer (rows)
[[0.5, 1],
[0.5, 1. ]]
# Standardize with StandardScaler (columns)
[[-1, -1],
[1, 1. ]]