Search code examples
analyticsmahoutclassification

Use case for incremental supervised learning using apache mahout


Business case: Forecasting fuel consumption at site.

Say fuel consumption C, is dependent on various factors x1,x2,...xn. So mathematically speaking, C = F{x1,x2,...xn}. I do not have any equation to put this.

I do have historical dataset from where I can get a correlation of C to x1,x2 .. etc. C,x1,x2,.. are all quantitative. Finding out the correlation seems tough for a person like me with limited statistical knowledge, for a n variable equation.

So, I was thinking of employing some supervised machine learning techniques for the same. I will train a classifier with the historic data to get a prediction for the next consumption.

Question: Am I thinking in the right way? Question: If this is correct, my system should be an evolving one. So the more real data I am going to feed to the system, that would evolve my model to make a better prediction the next time. Is this a correct understanding?

If the above the statements are true, does the AdaptiveLogisticRegression algorithm, as present in Mahout, will be of help to me?

Requesting advises from the experts here!

Thanks in advance.


Solution

  • Ok, correlation is not a forecasting model. Correlation simply ascribes some relationship between the datasets based on covariance.

    In order to develop a forecasting model, what you need to peform is regression.

    The simplest form of regression is linear univariate, where C = F (x1). This can easily be done in Excel. However, you state that C is a function of several variables. For this, you can employ linear multivariate regression. There are standard packages that can perform this (within Excel for example), or you can use Matlab, etc.

    Now, we are assuming that there is a "linear" relationship between C and the components of X (the input vector). If the relationship were not linear, then you would need more sophisticated methods (nonlinear regression), which may very well employ machine learning methods.

    Finally, some series exhibit auto-correlation. If this is the case, then it may be possible for you to ignore the C = F(x1, x2, x3...xn) relationships, and instead directly model the C function itself using time-series techniques such as ARMA and more complex variants.

    I hope this helps, Srikant Krishna