I'm trying to detect linear trends in data, see the example plot:
The data may be horizontal or increasing/decreasing.
I've tried kmeans clustering and dbscan with reasonable results but cannot help feeling there is probably a better algorithm out there. One of the drawbacks of using clustering is that the number of clusters has to be pre-defined, and that requires prior knowledge of the data. That will not be the case in this application as it is intended to be automated and new data trends can appear at random.
Does any one have any other suggestions to try?
You can use kmeans with some fit criteria to determine the number of clusters.
Here is an example just using kmeans and a known number of clusters based on how the data is generated:
rng(1); % Reset the random number generator
N = 5; % Number of noisy lines to generate
y = rand( 1, N ) * 5; % Initial line y coordinates
n = randi([5,100], 1, N ); % Number of points in each line
y = repelem( y, 1, n ); % Generate a full y values array
y = y + rand(size(y))*0.1; % Add noise
y = y( randsample(numel(y), numel(y)) ); % shuffle
x = linspace( 0, 1, numel(y) ); % Generate x coords
% Cluster based on the y coordinates only
k = kmeans( y(:), N );
% Visualise the result, plot a horizontal line for each cluster
figure(99); clf;
for ii = 1:N
yb = mean( y(k==ii) );
line( [min(x), max(x)], [yb, yb], 'color', 'k', 'linestyle', '-' );
end
line( x, y, 'marker', 'o', 'linestyle', 'none' );
ylim( [min(y), max(y)] + [-0.5, 0.5] );
This could be extended to increase the number of clusters until some goodness-of-fit criteria is met
% Cluster based on the y coordinates only
NMax = 10; % Max number of clusters
thresh = 0.25; % Max error
for n = 1:NMax
k = kmeans( y(:), n );
err = arrayfun( @(nn) norm(y(k==nn) - mean(y(k==nn))), 1:n );
if max(err) < thresh
% This is enough clusters to get a decent fit
break
end
end
This would suggest n
clusters is enough to give an adequate fit, for this example n=0.5
gives 4 lines where n=0.25
gives 5.
You could use kmedoids
and a custom distance function to more heavily weight the y
coordinate instead of completely ignoring the x
coordinate if the lines aren't exclusively horizontal
w = 2; % weighting in favour of close vertical distance
% Define a distance function which more heavily weights the y coord
fdist = @(zi,zj) (zi(1) - zj(:,1)).^2 + w*(zi(2) - zj(:,2)).^2;
k = kmedoids( [x(:),y(:)], n, 'distance', fdist );