How to Optimally Place Two Points on a Number Line to Minimize the Total Distance to a Set of Given Points

Given several points on a number line, such as x=1, 2, 5, and 6. I need to place two additional points x1 and x2 on the number line such that the total distance between x=1,2,5,6 and x1 and x2 is minimised. What is an efficient way to search for these points?

e.g. for x=1,2,5,6 and x1 = 1, x2 = 6 would be optimal as 1-1=0,2-1=1,6-5=1,6-6=0. So the total distance would be 2

Solution

If we were to only place one point, we would want to place it at the median location of the x points. The median minimizes the mean absolute error. And of course the total error too. The intuitive explanation is that taking one step left on the number line (-1) the total error is decreased by the number of points to the left and increased by the number of points to the right. The optimum is when we have the same number of points on both sides: the median.

If we can place two points, we will want to place them such that one of them is the median of the first set of points and the other the median of the second set. There are only N choices for splitting the points into two sets, so we can evaluate each option and choose the best one.

import numpy as np
x = np.array([1, 2, 3, 4, 5, 1000, 1001, 1002])
# Sum of the first k elements.
sums = np.insert(np.cumsum(x), 0, 0)
N = len(x)
# Split is the index of the last element in the first half.
split = np.arange(N-1)
# The median positions.
i1 = (split + 1)//2
i2 = split + (N - split)//2
# From start to before i1.
e1 = x[i1]*i1 - sums[i1]
# From after i2 to the split.
e2 = sums[split+1] - sums[i1] - x[i1]*(split - i1 + 1)
# From the split to before i2.
e3 = x[i2]*(i2 - split - 1) - sums[i2] + sums[split+1]
# From after i2 to the end.
e4 = sums[-1] - sums[i2 + 1] - x[i2]*(N - i2 - 1)
error = e1 + e2 + e3 + e4
best = np.argmin(error)
print(f"best x1: {x[i1[best]]}, x2: {x[i2[best]]}, error: {error[best]}")
# best x1: 3, x2: 1001, error: 8

Two concerns with the above code:

It's hard to write this without off-by-one errors. I can only hope it's right.
The reasoning is backwards here. We calculate the error as if everything up to the split were closer to x1 than to x2. This is not generally the case. But I think 1) it's true for the correct solution, and 2) it results in overestimating the error for the incorrect solutions. So it works out in the end.