Search code examples
pythondataframenormalization

How is Dataframe normalization being done?


I am trying to understand the Normalization of the Dataframe values. Here is the scenario from the famous disaster i.e. Titanic and here is the code and result from a query:

dftitanic.groupby('Fsize')['Survived'].value_counts(normalize=False).reset_index(name='perc')

Result:

    Fsize  Survived  perc
0       1         0   374
1       1         1   163
2       2         1    89
3       2         0    72
4       3         1    59
5       3         0    43
6       4         1    21
7       4         0     8
8       5         0    12
9       5         1     3
10      6         0    19
11      6         1     3
12      7         0     8
13      7         1     4
14      8         0     6
15     11         0     7


And if I use .value_counts(normalize=True), the result would be:

dftitanic.groupby('Fsize')['Survived'].value_counts(normalize=True).reset_index(name='perc')
    Fsize  Survived      perc
0       1         0  0.696462
1       1         1  0.303538
2       2         1  0.552795
3       2         0  0.447205
4       3         1  0.578431
5       3         0  0.421569
6       4         1  0.724138
7       4         0  0.275862
8       5         0  0.800000
9       5         1  0.200000
10      6         0  0.863636
11      6         1  0.136364
12      7         0  0.666667
13      7         1  0.333333
14      8         0  1.000000
15     11         0  1.000000

And the data from describe():

        Fsize   Survived    Perc
count   16.0000 16.000000   16.000000
mean    4.6875  0.437500    55.687500
std     2.7500  0.512348    95.378347
min     1.0000  0.000000    3.000000
25%     2.7500  0.000000    6.750000
50%     4.5000  0.000000    15.500000
75%     6.2500  1.000000    62.250000
max     11.0000 1.000000    374.000000

My effort:

From https://stackoverflow.com/a/41532180, I got the following methods:

  1. normalized_df=(df-df.mean())/df.std()

  2. normalized_df=(df-df.min())/(df.max()-df.min()

However, from the results of describe(), the above two methods not matching the results of .values_counts(normalize=True).

A similar formula and description is present here: but didn't get understandable results.

Question:

How this Normalization being done? i.e. .value_counts(normalize=True)


Solution

  • In the context of the Pandas GroupBy operation, the 'normalize' parameter, when set to 'True', normalizes the values to show percentages instead of counts. This means that the output will display the percentage distribution of the data rather than the raw counts.

    Regarding the article on normalization in machine learning, it is essential to differentiate between the 'normalize' parameter in Pandas and normalization techniques in machine learning. In Pandas, 'normalize' is specific to calculating percentages within groups, while normalization in machine learning refers to scaling features to a range, often between 0 and 1, to ensure uniformity and prevent certain features from dominating the model due to their scale. Standardization, on the other hand, involves transforming data to have a mean of 0 and a standard deviation of 1, aiding in comparison and interpretation of different features.

    In summary, the article you provided is showing math intuition of Standardization() and Normalization() that is used in Machine learning and that is different from this concept in which you are using 'normalize'. This is the parameter used in pandas library on the other hand the articcle is showing math formula of both the types of transformation that is used before executing the data into algorithm (i.e; Data Preprocessing) so that it makes it easy for machine to interpret near to accurate results. Both the technique comes under the scikit-learn (sklearn) library. from sklearn.preprocessing import StandardScaler, MinMaxScaler.

    If you still have doubt feel free to reach out.