Problem definition
I'm attempting to reconstruct a non-convex and non-self-intersecting polygonal line (which I call the target line) from a noisy dataset (see the figure below). Therefore, my objective is to learn the positions of the vertices of the target line from the dataset. I refer to the estimated polygonal line as the polygonal line.
The target line is assumed to be normalized, i.e., contained in the intervals [-1,1] (horizontal direction) and [0,1] (vertical direction). Moreover, the target line starts from the point [1 0]'
and ends at a point crossing the horizontal axis of the reference system.
The main challenges of this problem are the following:
The dataset is unordered, meaning that the only information available is the position of the noisy observations. The order refers to the time sequence in which we would observe the observations if we traveled along the target line from its starting point to its ending point. It's important to note that if the order is given, the problem can be easily solved with standard curve-fitting solvers (e.g., by least squares minimization). However, this is not the case for me...
The algorithm must be fast even for large datasets (i.e., the computational time must be on the order of seconds even when the dataset contains 1e4
or even more observations) because I need to implement it in a much larger system that runs in real-time. Thus, I cannot consider fancy algorithms based on optimization problems.
The algorithm must be robust when the dataset is small (e.g., if the number of observations is on the order of 10 or fewer) or when the standard deviation of the Gaussian noise is large (e.g., larger than 1
in the current normalized units). By robust, I mean that in these challenging scenarios, the estimated shape should be something simple and meaningful, such as a half-ellipse, rather than a randomly intertwined line. In other words, the algorithm must err on the side of caution: if there is limited information, the response should be simple and vague.
The number of vertices of the estimated polygonal line has to be equal to a prefixed value N
I'm able to obtain a rough estimate with an algorithm that is quite fast (even though it's not optimized), so I'm seeking suggestions to enhance it regarding some critical issues. I use the following conventions:
hatV
is a 2xN matrix containing the coordinates of the N
vertices
forming the polygonal line;D
is a 2xm matrix containing the coordinates of the m
observations forming the dataset. Such points are generated by uniformly picking a point on the target line and then by adding Gaussian noise.My algorithm takes as input D
, N
, and returns hatV
.
My algorithm (overview)
function hatV = staticShaper(D, N)
% step 0: unfrozen initialization
hatV = 1.5 * [cos(linspace(0, pi, N)); sin(linspace(0, pi, N))];
freeze = zeros(1, N);
for iter = 1 : 3
% step 1: project the polygonal line over the dataset
idx = projectLine(hatV, D);
% step 2: update the polygonal line
hatV = updateLine(hatV, D, idx, freeze);
% step 3: simplify the polygonal line
[hatV, Ndeficit] = simplifyLine(hatV);
% step 4: interpolate and freeze the polygonal line
[hatV, freeze] = interpolateLine(hatV, Ndeficit);
end
end
D
) is fully contained within the intervals [-1, 1] (horizontal direction) and [0, 1] (vertical direction). Accordingly, I initialize the polygonal line hatV
by placing the N
vertices along a semicircle of radius 1.5, ensuring that hatV
encompasses the target line. Additionally, I utilize a vector freeze
that contains a flag for each vertex. If freeze(i) == 1
, then the i-th vertex is frozen, meaning its position cannot be changed. If freeze(i) == 0
, then the i-th vertex is not frozen, allowing its position to be changed.idx(i)
indicates that the nearest observation to vertex i is the idx(i)
-th observation within the dataset D
.N
to N - Ndeficit
.N
vertices, I insert Ndeficit
new vertices into the simplified polygonal line. The newly inserted vertices represent a degree of freedom, while the old vertices are considered fixed. Therefore, I freeze the old vertices and designate the newly inserted vertices as unfrozen, allowing their positions to be updated in the next cycle of the algorithm.Questions
This picture depicts the four different steps of my algorithm over three iterations, considering N = 50
. The dots represent the unfrozen vertices, while the circled dots indicate the frozen vertices. The black lines connect the unfrozen vertices with their nearest observations in the dataset.
My main concern is about how to improve the accuracy of the result. There are two main issues with my solution.
So my first question is: how do I solve these two issues? In addition to this question, I have the following one: how do I prove (or disprove) that my algorithm generates a polygonal line that does not intersect with itself?
My algorithm (more details)
function idx = projectLine(hatV, D)
idx = NaN(1, size(hatV, 2));
for i = 1 : size(hatV, 2)
[~, idx(i)] = min(vecnorm(hatV(:, i) - D));
end
end
For each vertex hatV(:, i)
I search for the nearest observation in D
.
function hatV = updateLine(hatV, D, idx, freeze)
for i = 1 : size(hatV, 2)
if not(freeze(i))
hatV(:, i) = D(:, idx(i));
end
end
hatV(:, 1) = [1 0]';
hatV(2, end) = 0;
end
I place each unfrozen vertex hatV(:, i)
over its nearest observation D(:, idx(i))
. After that, I ensure that the first vertex hatV(:, 1)
is at [1 0]'
, and I confirm that the final vertex hatV(:, end)
has no vertical component. This is because I know that the starting point of the target line is [1 0]'
and that the target line ends on the horizontal axis of the reference system.
function [hatV, Ndeficit] = simplifyLine(hatV)
oldHatV = hatV; % backup
k = 1;
hatV = [];
hatV(:, 1) = oldHatV(:, 1);
for i = 1 : size(oldHatV, 2)
if norm(oldHatV(:, i) - hatV(:, k)) < 0.01
hatV(:, k) = 0.5 * (hatV(:, k) + oldHatV(:, i));
else
k = k + 1;
hatV(:, k) = oldHatV(:, i);
end
end
Ndeficit = size(oldHatV, 2) - size(hatV, 2);
end
I have to say that this apparently simple task takes me some time, and I'm not even sure if this is a good implementation. However, here I aim to eliminate identical or similar vertices. To achieve this, I examine the distances between vertices. If the distance is small enough, I perform a merging operation by taking the mean value. In the end, I count how many vertices are lost during these operations.
function [hatV, freeze] = interpolateLine(hatV, Ndeficit)
% step 1: normalized edge lengths computation
ell = NaN(1, size(hatV, 2) - 1);
for i = 1 : size(hatV, 2) - 1
ell(i) = norm(hatV(:, i + 1) - hatV(:, i));
end
ell = ell / sum(ell);
% step 2: new vertices generation
newVertContainer = cell(1, size(hatV, 2) - 1);
for i = 1 : size(hatV, 2) - 1
nrNewVertices = round(Ndeficit * ell(i));
alpha = 0;
newVertContainer{i} = [];
for j = 1 : nrNewVertices
newVertex = (1 - alpha) * hatV(:, i) + alpha * hatV(:, i + 1);
newVertContainer{i} = cat(2, newVertContainer{i}, newVertex);
alpha = alpha + 1 / nrNewVertices;
end
end
% step 3: output definition
oldHatV = hatV;
hatV = oldHatV(:, 1);
freeze = 1;
for i = 1 : size(oldHatV, 2) - 1
hatV = cat(2, hatV, newVertContainer{i}, oldHatV(:, i + 1) );
freeze = cat(2, freeze, zeros(1, size(newVertContainer{i}, 2)), 1);
end
end
The idea is to insert new vertices 'uniformly' along the current polygonal line. To achieve this, I add a number of new vertices to each edge of the polygonal line that is proportional to the edge length.
The first step is to compute the length of each edge of the estimated polygonal line and normalize the results. If edge i has a normalized length ell(i)
, then the number of new vertices generated between the two old vertices defining the edge is round(Ndeficit * ell(i))
.
Clearly, this strategy does not guarantee that the polygonal line will have exactly N
vertices in the end, but this issue is not a big problem at the moment (I will fix it later).
In step 2, I generate the new vertices, which are temporarily stored in newVertContainer
. Finally, in step 3, I define the updated polygonal line where the old vertices are frozen, and the new vertices are unfrozen.
k = boundary(D(1,:).',D(2,:).',1);
bound = [D(1,k).',D(2,k).'];
(The third input argument in this call of MATLAB's boundary function is an optional argument that is used to "tighten" the boundary around the dataset if it is close to 1)
The code runs in around 20ms on my machine and is pretty good at giving you both the upper and lower boundaries of the dataset:
scatter(D(1,:).',D(2,:).');hold on
plot(bound(:,1),bound(:,2),'--k','linewidth',2)
% Start from the point closest to [1,0]
[~,idx]=min(sqrt((bound(:,1)-1).^2+bound(:,2).^2));
if idx>1
bound = [bound(idx:end,:);bound(1:(idx-1),:)];
end
% Get the index of the next intersection with the horizontal axis
% (exclude a few points before and after the initial point)
npts_exc = 6;
[~,idx2]=min(bound((npts_exc):(end-npts_exc),2));
bound1 = bound(1:(npts_exc+idx2-1),:);
bound2 = bound((npts_exc+idx2-1):end,:);
% Follow curve 1. for each point on curve 1 get the closest one on
% curve 2 and then construct the mid point
bound_avg = zeros(size(bound1));
for ii=1:size(bound1,1)
[~,idx_min]=min(sqrt((bound2(:,1)-bound1(ii,1)).^2+(bound2(:,2)-bound1(ii,2)).^2));
bound_avg(ii,:) = 0.5*(bound1(ii,:)+bound2(idx_min,:));
end
This still runs in around 30ms total. It gives the following results:
scatter(D(1,:).',D(2,:).');hold on
plot(bound1(:,1),bound1(:,2),'--k','linewidth',1)
plot(bound2(:,1),bound2(:,2),'--r','linewidth',1)
plot(bound_avg(:,1),bound_avg(:,2),'-m','linewidth',2)
The last thing to do is to resample the curve to get the right length of N
points.
% Resampling so that the N points on the curve are uniformally positionned
% on the curve length
N = 50;
bound_res = zeros(N,2);
% Keep endpoints
bound_res(1,:) = bound_avg(1,:);
bound_res(end,:) = bound_avg(end,:);
% Cumulative curve length
c_len = cumsum(sqrt((bound_avg(1:(end-1),1)-bound_avg(2:end,1)).^2+(bound_avg(1:(end-1),2)-bound_avg(2:end,2)).^2));
% Curve length will have to be unique for interp1 to work
[C,ia,~] = unique(c_len);
% remove non unique points from bound_avg as well
bound_avg_un = bound_avg(ia+1,:);
% Target cumulative curve length for uniform distribution
step_size = c_len(end)/(N-1);
target_lengths = step_size*(1:(N-1));
% Interpolate on the target lengths
bound_res(2:(end-1),1) = interp1(C,bound_avg_un(:,1),target_lengths(1:(end-1)));
bound_res(2:(end-1),2) = interp1(C,bound_avg_un(:,2),target_lengths(1:(end-1)));