Search code examples
pythonmachine-learningregressionsklearn-pandaspolynomials

Machine learning for alternate time periods


I have a polynomial regression script that works correctly to predict values ​​with X and Y axis, in my example I use CPU consumption, below we see an example of the data set:

enter image description here

Complete data set

Where time represents the collection time, example:

1 = 1 minute
2 = 2 minute

And so on...

And consume is the use value of the cpu for that minute, summarizing this data set demonstrates the behavior of a host in the period of 30 minutes, each value corresponding to one minute in ascending order (1min, 2min, 3min ...)

The result for this is:

enter image description here

With this algorithm:

# -*- coding: utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)

# Visualizing the Polymonial Regression results
def viz_polymonial():
    plt.scatter(X, y, color='red')
    plt.plot(X, pol_reg.predict(poly_reg.fit_transform(X)), color='blue')
    plt.title('Polynomial Regression for CPU')
    plt.xlabel('Time range')
    plt.ylabel('Consume')
    plt.show()
    return
viz_polymonial()

# 20 = time
print(pol_reg.predict(poly_reg.fit_transform([[20]])))

What's the problem?

If we duplicate this data set so that the 30 minute range appears 2x, the algorithm does not understand the data set and its result is not as efficient, example of the data set:

enter image description here --> Up to time = 30 enter image description here --> Up to time = 30

Complete data set

Note: In the case it has 60 values, where every 30 values ​​represents the range of 30 minutes, it is as if they were different collection days.

The result it shows is this:

enter image description here

Objective: I would like the blue line that represents the polynomial regression to be similar to the first result image, the one we see above demonstrates a loop, where the points are connected, it is as if the algorithm had failed.

Research source


Solution

  • The problem is that in the second case, you plot using X = 1, 2, ... 30, 1, 2, ... 30. The plot function connects successive points. If you just plotted a scatter using pyplot, you would see your nice regression curve. Or you could argsort. Here is the code with the scatter in green, the argsort line in black.

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.linear_model import LinearRegression
    
    # Importing the dataset
    # dataset = pd.read_csv('data.csv')
    dataset = pd.read_csv('data.csv')
    X = dataset.iloc[:, 1:2].values
    y = dataset.iloc[:, 2].values
    
    # Splitting the dataset into the Training set and Test set
    from sklearn.model_selection import train_test_split 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    
    # Fitting Polynomial Regression to the dataset
    from sklearn.preprocessing import PolynomialFeatures
    poly_reg = PolynomialFeatures(degree=4)
    X_poly = poly_reg.fit_transform(X)
    pol_reg = LinearRegression()
    pol_reg.fit(X_poly, y)
    
    # Visualizing the Polymonial Regression results
    def viz_polymonial():
        plt.scatter(X, y, color='red')
        indices = np.argsort(X[:, 0])
        plt.scatter(X, pol_reg.predict(poly_reg.fit_transform(X)), color='green')
        plt.plot(X[indices], pol_reg.predict(poly_reg.fit_transform(X))[indices], color='black')
        plt.title('Polynomial Regression for CPU')
        plt.xlabel('Time range')
        plt.ylabel('Consume')
        plt.show()
        return
    viz_polymonial()
    
    # 20 = time
    print(pol_reg.predict(poly_reg.fit_transform([[20]])))
    

    Here is the output image for the larger dataset.enter image description here