Search code examples
matlabcluster-analysisdbscantrend

Best algorithm to detect multiple linear trends in data


I'm trying to detect linear trends in data, see the example plot:

example plot

The data may be horizontal or increasing/decreasing.

I've tried kmeans clustering and dbscan with reasonable results but cannot help feeling there is probably a better algorithm out there. One of the drawbacks of using clustering is that the number of clusters has to be pre-defined, and that requires prior knowledge of the data. That will not be the case in this application as it is intended to be automated and new data trends can appear at random.

Does any one have any other suggestions to try?


Solution

  • You can use kmeans with some fit criteria to determine the number of clusters.

    Here is an example just using kmeans and a known number of clusters based on how the data is generated:

    rng(1); % Reset the random number generator
    N = 5;  % Number of noisy lines to generate
    
    y = rand( 1, N ) * 5;      % Initial line y coordinates
    n = randi([5,100], 1, N ); % Number of points in each line
    y = repelem( y, 1, n );    % Generate a full y values array
    y = y + rand(size(y))*0.1;               % Add noise
    y = y( randsample(numel(y), numel(y)) ); % shuffle
    x = linspace( 0, 1, numel(y) );          % Generate x coords
    
    % Cluster based on the y coordinates only
    k = kmeans( y(:), N );
    
    % Visualise the result, plot a horizontal line for each cluster
    figure(99); clf;
    for ii = 1:N
        yb = mean( y(k==ii) );
        line( [min(x), max(x)], [yb, yb], 'color', 'k', 'linestyle', '-' );
    end
    line( x, y, 'marker', 'o', 'linestyle', 'none' );
    ylim( [min(y), max(y)] + [-0.5, 0.5] );
    

    This could be extended to increase the number of clusters until some goodness-of-fit criteria is met

    % Cluster based on the y coordinates only
    NMax = 10; % Max number of clusters
    thresh = 0.25; % Max error
    for n = 1:NMax
        k = kmeans( y(:), n );
        err = arrayfun( @(nn) norm(y(k==nn) - mean(y(k==nn))), 1:n );
        if max(err) < thresh
            % This is enough clusters to get a decent fit
            break
        end
    end
    

    This would suggest n clusters is enough to give an adequate fit, for this example n=0.5 gives 4 lines where n=0.25 gives 5.

    plot

    You could use kmedoids and a custom distance function to more heavily weight the y coordinate instead of completely ignoring the x coordinate if the lines aren't exclusively horizontal

    w = 2; % weighting in favour of close vertical distance
    % Define a distance function which more heavily weights the y coord
    fdist = @(zi,zj) (zi(1) - zj(:,1)).^2 + w*(zi(2) - zj(:,2)).^2;
    k = kmedoids( [x(:),y(:)], n, 'distance', fdist );