Search code examples
machine-learningscikit-learnregressionrandom-forestlogistic-regression

Machine learning models don't work with continuous data


I'm attempting to get a machine learning model to predict a baseball players Batting Average based on their At Bats and Hits. Since:

Batting Average = Hits/At Bats

I would think this relationship would be relatively easier to discover. However, since Batting Average is float (i.e. 0.300), all the models I try return the following error:

ValueError: Unknown label type: 'continuous'

I'm using sklearns models. I've tried LogisticRegression, RandomForestClassifier, LinearRegression. They all have the same problem.

From reading other StackOverflow posts on this error, I began doing this:

lab_enc = preproccessing.LabelEncoder()
y = pd.DataFrame(data=lab_enc.fit_transform(y))

Which seems to change values such as 0.227 to 136 which seems odd to me. Probably just because I don't quite understand what the transform is doing. I would, if possible, prefer just using the actual Batting Average values.

Is there a way to get the models I tried to work when predicting continuous values?


Solution

  • The problem you are trying to solve falls into the regression (i.e. numeric prediction) context, and it can certainly be dealt with ML algorithms.

    I'm using sklearns models. I've tried LogisticRegression, RandomForestClassifier, LinearRegression. They all have the same problem.

    The first two algorithms you mention here (Logistic Regression and Random Forest Classifier) are for classification problems, and thus are not suitable for your (regression) setting (they expectedly produce the error you mention). Linear Regression however is suitable and it should work fine here.

    Please, for starters, stick to Linear Regression, in order to convince yourself that it can indeed handle the problem; you can subsequently extend to other scikit-learn algorithms like RandomForestRegressor etc. If you face any issues, open a new question with the specific code & error(s)...