Search code examples
pythonregressionnon-linear-regression

Regression with a small dataset


We examined a software which was supposedly used for cracking. We discovered that the working time depends significantly on input length N, especially when N is greater than 10-15. During our tests, we fixed the following working times.

N = 2 - 16.38 seconds 
N = 5 - 16.38 seconds 
N = 10 - 16.44 seconds 
N = 15 - 18.39 seconds 
N = 20 - 64.22 seconds 
N = 30 - 65774.62 seconds

Tasks: of Find the program working times for the following three cases - N = 25, N = 40 and N = 50.

I tried to do polynomial regression but the predictions varied from degree 2,3, ...

# Importing the libraries 
import numpy as np 
import matplotlib.pyplot as plt 

# Importing the dataset 
X = np.array([[2],[5],[10],[15],[20],[30]])
X_predict = np.array([[25], [40], [50]])
y = np.array([[16.38],[16.38],[16.44],[18.39],[64.22],[65774.62]])
#y = np.array([[16.38/60],[16.38/60],[16.44/60],[18.39/60],[64.22/60],[65774.62/60]])


# Fitting Polynomial Regression to the dataset 
from sklearn.preprocessing import PolynomialFeatures 

poly = PolynomialFeatures(degree = 11) 
X_poly = poly.fit_transform(X) 

poly.fit(X_poly, y) 
lin2 = LinearRegression() 
lin2.fit(X_poly, y) 

# Visualising the Polynomial Regression results 
plt.scatter(X, y, color = 'blue') 

plt.plot(X, lin2.predict(poly.fit_transform(X)), color = 'red') 
plt.title('Polynomial Regression') 


plt.show() 

# Predicting a new result with Polynomial Regression 
lin2.predict(poly.fit_transform(X_predict))

For degree 2 the results were

array([[ 32067.76147835],
       [150765.87808383],
       [274174.84800471]])

For degree 5 the results were

array([[  10934.83739791],
       [ 621503.86217946],
       [2821409.3915933 ]])

Solution

  • After equation search I was able to fit the data to the equation "seconds = a * exp(b * N) + Offset" with fitted parameters a = 2.5066753490350954E-05, b = 7.2292352155213369E-01, and Offset = 1.6562196782144639E+01 giving RMSE = 0.2542 and R-squared = 0.99999. This combination of data and equation is extremely sensitive to initial parameter estimates. As you can see, it should interpolate with high accuracy within the data range. Since the equation is simple it will likely extrapolate well outside the data range. As I understand your description, if different computer hardware is used or if the cracking algorithm is parallelized then this solution would not match those changes.

    enter image description here