This could be more of a theoretical question than a code-related one. In my current job I find myself estimating/ predicting (this last is more opportunistic) the water level for a given river in Africa.
The point is that I am developing a simplistic multiple regression model that takes more than 15 years of historical water levels and precipitation (from different locations) to generate water level estimates.
I am not that used to work with Machine Learning or whatever the correct name is. I am more used to model data and generate fittings (the current data can be perfectly defined with asymetric gaussians and sigmoids functions combined with low order polynomials.
So the point is; once I have a multiple regression model, my colleagues advised me not to use fitted data for the estimation but all the raw data instead. Since they couldn't explain to me the reason of that, I attempted to use the fitted data as raw inputs (in my defense, a median of all the fitting models has a very low deviation error == nice fittings). But what I don't understand is why should I use just the raw data, which cold be noisy, innacurate, taking into account factors that are not directly related (biasing the regression?). What is the advantage of that?
My lack of theoretical knowledge in the field is what makes me wonder about that. Should I always use all the raw data to determine the variables of my multiple regression or can I use the fitted values (i.e. get a median of the different fitting models of each historical year)?
Thanks a lot!
here is my 2 cents
I think your colleagues are saying that because it would be better for the model to learn the correlations between the raw data and the actual rain fall.
In the field you will start with the raw data so being able to predict directly from it is very useful. The more work you do after the raw data is work you will have to do every time you want to make a prediction.
However, if a simpler model work perfectly defined with asymetric gaussians and sigmoids functions combined with low order polynomials
then I would recommend doing that. As long as your (y_pred - t_true) ** 2 is very small