Search code examples
javaapache-sparklinear-regressionapache-spark-ml

How to make predictions with Linear Regression Model?


I am currently working on a linear regression project where I need to gather data, fit it on a model, and then make a prediction based on test data.

If I'm correct, simple linear regression works with two variables, X (independent) and Y (dependent). I have the following Dataset, where I consider the time column to be X and the value column to be Y:

+-----+------+
|value|minute|
+-----+------+
| 5000|   672|
| 6000|   673|
| 7000|   676|
| 8000|   678|
| 9000|   680|
+-----+------+

What I don't know is how to fit this Dataset correctly into a Linear Regression Model. I've worked with k-means before and what I did with it was create a features column in vector form. I did the same with this dataset:

VectorAssembler assembler = new VectorAssembler()
                .setInputCols(new String[]{"minute", "value"})
                .setOutputCol("features");

Dataset<Row> vectorData = assembler.transform(dataset);

I then fit this into a linear regression model:

LinearRegression lr = new LinearRegression();
LinearRegressionModel model = lr.fit(vectorData);

This is the part where I get stuck. How can I make predictions with this model? I want to find the value of value when minute is equal to a random minute, eg. 700.

How can I do that? How can I find a prediction/estimate of my Y value based on a random X value?

EDIT: Does the linear regression model differentiates between dependent and independent variable? How?


Solution

  • So thanks to the feedback of @RickMoritz and @JacekLaskowski I was able to figure out the solution:

    LinearRegression does indeed have X and Y columns. The X column is the features column and the Y column is the label column.

    So before fitting your dataset into a LinearRegression model, make sure to state your label and features columns. You can set your label column when you define your LinearRegression:

    LinearRegression lr = new LinearRegression().setLabelCol(Ycolumn_name);

    For the features column, make sure you convert your X column into vector type, and then you can do the same:

    LinearRegression lr = new LinearRegression().setFeaturesCol(Xcolumn_name);

    Once you've done that you're all set. To get a prediction just convert your X value into a vector and put it on the predict() function of the LinearRegressionModel.