I have a two large datasets in R, one of actual measurements and one of the predictions I made for these measurements. I found that the trends of my predictions were accurate, but the amplitude was off. I am wondering if there is a way to find a constant in R that, when the predictions are multiplied by the constant, minimizes the error between the actuals and the predictions.
For example:
predictions <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
actuals <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
The constant I would want to generate in this case would be 2.
I have looked into using the optim() function, but get the warning message that "one-dimensional optimization by Nelder-Mead is unreliable: use 'Brent' or optimize() directly."
f <- function(p) cor(p*observed, actual)
optim(
c(1),
f,
control = list(fnscale = -1)
)
I am not familiar with optimization, so it is probable that I am approaching this problem the wrong way. I appreciate the help!
First let's define an error function to minimize:
MultError <- function(constant, predictions, actuals) {
return(sum((constant*predictions - actuals)^2))
}
This is sum of squared errors...you could use a different one!
optimize()
expects a function, an interval to search in (which you could get by inspecting the min and max of predictions
/ actuals
), and any extra parameters. It will minimize by default
optimize(MultError, interval=c(0, 5), predictions=predictions, actuals=actuals)
This returns
$minimum
[1] 2
$objective
[1] 0
Which is the value of the minimum and the value of the error function, respectively.
Presumably, your match is not perfect, so I also tried it with artificial noise
set.seed(1)
actuals <- rnorm(length(predictions), 2, 0.4) * predictions
Then it returns
$minimum
[1] 2.087324
$objective
[1] 22.21434
Pretty good!
EDIT:
I answered this question using optimize because of the title and the direction the OP had gone, but in thinking harder, it seemed like it might overkill. What's wrong with simply taking mean(actuals / predictions)
?
So I decided to test them both...
set.seed(1)
arithmetic <- opt <- numeric(10000)
for (trial in 1:10000) {
actuals <- rnorm(length(predictions), 2, 0.4) * predictions
arithmetic[trial] <- mean(actuals / predictions)
opt[trial] <- optimize(MultError, interval=c(0, 5), predictions=predictions, actuals=actuals)$minimum
}
For 10,000 possible datasets, we've recovered the constant using the average and by minimizing sum of squared errors. What are the mean and variance of our estimators?
> mean(arithmetic)
[1] 1.999102
> mean(opt)
[1] 1.998695
Both do pretty well on average.
> var(arithmetic)
[1] 0.0159136
> var(opt)
[1] 0.02724814
The arithmetic mean estimator has a tighter spread, however. So I would argue that you should just take the average!