Search code examples
scikit-learndata-sciencelogistic-regressionprediction

why is model.predict(...) always returning the same answer?


I'm trying to use scikit-learn to make predictions based on some client data, to determine a financial benefit estimate based on some answers they give us and based on our historical client projects.

My dataset looks like this:

 # Data (1-15 of 470)
 array(
    [[8662824,       34],
    [ 7978337,       25],
    [  902219,       28],
    [29890885,       64],
    [14357494,       60],
    [ 6403602,       43],
    [96538844,      372],
    [ 7675132,       67],
    [34807493,       78],
    [46215428,       75],
    [ 5437889,       20],
    [16674835,       50],
    [17382472,       20],
    [ 5437889,       20],
    [  313111,        0]])

 # Targets (1-15 of 470)
 array([2739267,   20539,   18304,   16052,   25391,   19444,   61550,
      94392,   75934,   52997,   67485,   92263,   37672, 6748523,
      20710])

There are 470 rows each in the actual data.

I'm using:

x_train, x_test, y_train, y_test = train_test_split(
    data,
    targets,
    test_size=.25,
    random_state=42
)
model = LogisticRegression(max_iter=5000)  # 5000 until I learn how to scale
model.fit(x_train, y_train)

# If I run model.predict(...), I get 30000, no matter what
model.predict([[50000, 50]]

Here's some actual shell output (see the score, also):

In [134]: model.predict([[16000000, 5]])
Out[134]: array([30000])

In [135]: model.predict([[150000, 20]])
Out[135]: array([30000])

In [138]: model.predict(np.array([[21500000000000, 2]]))
Out[138]: array([30000])

In [139]: model.predict(np.array([[21500000000000, -444444]]))
Out[139]: array([30000])

In [140]: model.predict([[2150000, 250]])
Out[140]: array([30000])

In [141]: model.score(x_test, y_test)
Out[141]: 0.009345794392523364

In [144]: model.n_iter_
Out[144]: array([4652], dtype=int32)

Here's some metadata from the model (via .__dict__):

{'penalty': 'l2',
 'dual': False,
 'tol': 0.0001,
 'C': 1.0,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'class_weight': None,
 'random_state': None,
 'solver': 'lbfgs',
 'max_iter': 5000,
 'multi_class': 'auto',
 'verbose': 0,
 'warm_start': False,
 'n_jobs': None,
 'l1_ratio': None,
 'n_features_in_': 2,
 ...

There's definitely more of a relationship between the 2 data points than what a score of .0093 would seem to indicate. After all, we're currently using the same data to make predictions in our mind. Do you know what it is that I'm doing wrong, or even in what circumstance it would be normal for a trained model to return the same answer always?


Solution

  • Your target value is a continuous variable so you need to use a regression model. For a simple regression model, you can use a linear regression or a decision tree. If you want a more complex model, you can use a random forest or a gradient boosting. If you use the linear regression model, don't forget to scale your features with a standard scaler or a robust scaler