Search code examples
pythonloss-functionleast-squaresabsolute-valuenorm

Interpreting the effect of LK Norm with different orders on training machine learning model with the presence of outliers


Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. Various distance measures, or norms, are possible. Generally speaking, calculating the size or length of a vector is often required either directly or as part of a broader vector or vector-matrix operation.

Even though the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. For instance, if there are many outliers instances in the dataset, in this case, we may consider using mean absolute error (MAE).

More formally, the higher the norm index, the more it focuses on large values and neglect small ones. This is why RMSE is more sensitive to outliers than MAE.) Source: hands on machine learning with scikit learn and tensorflow.

Therefore, ideally, in any dataset, if we have a great number of outliers, the loss function, or the norm of the vector "representing the absolute difference between predictions and true labels; similar to y_diff in the code below" should grow if we increase the norm... In other words, RMSE should be greater than MAE. --> correct me if mistaken <--

Given this definition, I have generated a random dataset and added many outliers to it as seen in the code below. I calculated the lk_norm for the residuals, or y_diff for many k values, ranging from 1 to 5. However, I found that the lk_norm decreases as the value of k increases; however, I was expecting that RMSE, aka norm = 2, to be greater than MAE, aka norm = 1.

How is LK norm decreasing as we increase K, aka the order, which is contrary to the definition above?

Code:

import numpy as np
import plotly.offline as pyo
import plotly.graph_objs as go
from plotly import tools

num_points = 1000
num_outliers = 50

x = np.linspace(0, 10, num_points)

# places where to add outliers:
outlier_locs = np.random.choice(len(x), size=num_outliers, replace=False)
outlier_vals = np.random.normal(loc=1, scale=5, size=num_outliers)

y_true = 2 * x
y_pred = 2 * x + np.random.normal(size=num_points)
y_pred[outlier_locs] += outlier_vals

y_diff = y_true - y_pred

losses_given_lk = []
norms = np.linspace(1, 5, 50)

for k in norms:
    losses_given_lk.append(np.linalg.norm(y_diff, k))

trace_1 = go.Scatter(x=norms, 
                     y=losses_given_lk, 
                     mode="markers+lines", 
                     name="lk_norm")

trace_2 = go.Scatter(x=x, 
                     y=y_true, 
                     mode="lines", 
                     name="y_true")

trace_3 = go.Scatter(x=x, 
                     y=y_pred, 
                     mode="markers", 
                     name="y_true + noise")

fig = tools.make_subplots(rows=1, cols=3, subplot_titles=("lk_norms", "y_true", "y_true + noise"))
fig.append_trace(trace_1, 1, 1)
fig.append_trace(trace_2, 1, 2)
fig.append_trace(trace_3, 1, 3)

pyo.plot(fig, filename="lk_norms.html")

Output:

enter image description here

Finally, in which cases one uses L3 or L4 norm, etc...?


Solution

  • Another python implementation for the np.linalg is:

    def my_norm(array, k):
        return np.sum(np.abs(array) ** k)**(1/k)
    

    To test our function, run the following:

    array = np.random.randn(10)
    print(np.linalg.norm(array, 1), np.linalg.norm(array, 2), np.linalg.norm(array, 3), np.linalg.norm(array, 10))
    # And
    print(my_norm(array, 1), my_norm(array, 2), my_norm(array, 3), my_norm(array, 10))
    

    output:

    (9.561258110585216, 3.4545982749318846, 2.5946495606046547, 2.027258231324604)
    (9.561258110585216, 3.454598274931884, 2.5946495606046547, 2.027258231324604)
    

    Therefore, we can see that the numbers are decreasing, similar to our output in the figure posted in the question above.

    However, the correct implementation of RMSE in python is: np.mean(np.abs(array) ** k)**(1/k) where k is equal to 2. As a result, I have replaced the sum by the mean.

    Therefore, if I add the following function:

    def my_norm_v2(array, k):
        return np.mean(np.abs(array) ** k)**(1/k)
    

    And run the following:

    print(my_norm_v2(array, 1), my_norm_v2(array, 2), my_norm_v2(array, 3), my_norm_v2(array, 10))
    

    Output:

    (0.9561258110585216, 1.092439894967332, 1.2043296427640868, 1.610308452218342)
    

    Hence, the numbers are increasing.

    In the code below I rerun the same code posted in the question above with a modified implementation and I got the following:

    import numpy as np
    import plotly.offline as pyo
    import plotly.graph_objs as go
    from plotly import tools
    
    num_points = 1000
    num_outliers = 50
    
    x = np.linspace(0, 10, num_points)
    
    # places where to add outliers:
    outlier_locs = np.random.choice(len(x), size=num_outliers, replace=False)
    outlier_vals = np.random.normal(loc=1, scale=5, size=num_outliers)
    
    y_true = 2 * x
    y_pred = 2 * x + np.random.normal(size=num_points)
    y_pred[outlier_locs] += outlier_vals
    
    y_diff = y_true - y_pred
    
    losses_given_lk = []
    losses = []
    norms = np.linspace(1, 5, 50)
    
    for k in norms:
        losses_given_lk.append(np.linalg.norm(y_diff, k))
        losses.append(my_norm(y_diff, k))
    
    trace_1 = go.Scatter(x=norms, 
                         y=losses_given_lk, 
                         mode="markers+lines", 
                         name="lk_norm")
    
    trace_2 = go.Scatter(x=norms, 
                         y=losses, 
                         mode="markers+lines", 
                         name="my_lk_norm")
    
    trace_3 = go.Scatter(x=x, 
                         y=y_true, 
                         mode="lines", 
                         name="y_true")
    
    trace_4 = go.Scatter(x=x, 
                         y=y_pred, 
                         mode="markers", 
                         name="y_true + noise")
    
    fig = tools.make_subplots(rows=1, cols=4, subplot_titles=("lk_norms", "my_lk_norms", "y_true", "y_true + noise"))
    fig.append_trace(trace_1, 1, 1)
    fig.append_trace(trace_2, 1, 2)
    fig.append_trace(trace_3, 1, 3)
    fig.append_trace(trace_4, 1, 4)
    
    pyo.plot(fig, filename="lk_norms.html")
    

    Output:

    enter image description here

    And that explains why the loss increase as we increase k.