python machine-learning regression sklearn-pandas polynomials

Machine learning for alternate time periods

I have a polynomial regression script that works correctly to predict values with X and Y axis, in my example I use CPU consumption, below we see an example of the data set:

Complete data set

Where time represents the collection time, example:

1 = 1 minute
2 = 2 minute

And so on...

And consume is the use value of the cpu for that minute, summarizing this data set demonstrates the behavior of a host in the period of 30 minutes, each value corresponding to one minute in ascending order (1min, 2min, 3min ...)

The result for this is:

With this algorithm:

# -*- coding: utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)

# Visualizing the Polymonial Regression results
def viz_polymonial():
    plt.scatter(X, y, color='red')
    plt.plot(X, pol_reg.predict(poly_reg.fit_transform(X)), color='blue')
    plt.title('Polynomial Regression for CPU')
    plt.xlabel('Time range')
    plt.ylabel('Consume')
    plt.show()
    return
viz_polymonial()

# 20 = time
print(pol_reg.predict(poly_reg.fit_transform([[20]])))

What's the problem?

If we duplicate this data set so that the 30 minute range appears 2x, the algorithm does not understand the data set and its result is not as efficient, example of the data set:

--> Up to time = 30 --> Up to time = 30

Complete data set

Note: In the case it has 60 values, where every 30 values represents the range of 30 minutes, it is as if they were different collection days.

The result it shows is this:

Objective: I would like the blue line that represents the polynomial regression to be similar to the first result image, the one we see above demonstrates a loop, where the points are connected, it is as if the algorithm had failed.

Research source

Solution

The problem is that in the second case, you plot using X = 1, 2, ... 30, 1, 2, ... 30. The plot function connects successive points. If you just plotted a scatter using pyplot, you would see your nice regression curve. Or you could argsort. Here is the code with the scatter in green, the argsort line in black.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression

# Importing the dataset
# dataset = pd.read_csv('data.csv')
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)

# Visualizing the Polymonial Regression results
def viz_polymonial():
    plt.scatter(X, y, color='red')
    indices = np.argsort(X[:, 0])
    plt.scatter(X, pol_reg.predict(poly_reg.fit_transform(X)), color='green')
    plt.plot(X[indices], pol_reg.predict(poly_reg.fit_transform(X))[indices], color='black')
    plt.title('Polynomial Regression for CPU')
    plt.xlabel('Time range')
    plt.ylabel('Consume')
    plt.show()
    return
viz_polymonial()

# 20 = time
print(pol_reg.predict(poly_reg.fit_transform([[20]])))

Here is the output image for the larger dataset.