matlab image-processing computer-vision computational-geometry robotics

detecting lump region in a 2D array

in the 2D array plotted below, we are interested in finding the "lump" region. As you can see it is not a continuous graph. Also, we know the approximate dimension of the "lump" region. A set of data are given below. First column contains the y values and the second contains the x values. Any suggestion as to how to detect lump regions like this?

enter image description here

   21048        -980
   21044        -956
   21040        -928
   21036        -904
   21028        -880
   21016        -856
   21016        -832
   21016        -808
   21004        -784
   21004        -760
   20996        -736
   20996        -712
   20992        -684
   20984        -660
   20980        -636
   20968        -612
   20968        -588
   20964        -564
   20956        -540
   20956        -516
   20952        -492
   20948        -468
   20940        -440
   20936        -416
   20932        -392
   20928        -368
   20924        -344
   20920        -320
   20912        -296
   20912        -272
   20908        -248
   20904        -224
   20900        -200
   20900        -176
   20896        -152
   20888        -128
   20888        -104
   20884         -80
   20872         -52
   20864         -28
   20856          -4
   20836          16
   20812          40
   20780          64
   20748          88
   20744         112
   20736         136
   20736         160
   20732         184
   20724         208
   20724         232
   20724         256
   20720         280
   20720         304
   20720         328
   20724         352
   20724         376
   20732         400
   20732         424
   20736         448
   20736         472
   20740         496
   20740         520
   20748         544
   20740         568
   20736         592
   20736         616
   20736         640
   20740         664
   20740         688
   20736         712
   20736         736
   20744         760
   20748         788
   20760         812
   20796         836
   20836         860
   20852         888
   20852         912
   20844         936
   20836         960
   20828         984
   20820        1008
   20816        1032
   20820        1056
   20852        1080
   20900        1108
   20936        1132
   20956        1156
   20968        1184
   20980        1208
   20996        1232
   21004        1256
   21012        1280
   21016        1308
   21024        1332
   21024        1356
   21028        1380
   21024        1404
   21020        1428
   21016        1452
   21008        1476
   21004        1500
   20992        1524
   20980        1548
   20956        1572
   20944        1596
   20920        1616
   20896        1640
   20872        1664
   20848        1684
   20812        1708
   20752        1728
   20664        1744
   20640        1768
   20628        1792
   20628        1816
   20620        1836
   20616        1860
   20612        1884
   20604        1908
   20596        1932
   20588        1956
   20584        1980
   20580        2004
   20572        2024
   20564        2048
   20552        2072
   20548        2096
   20536        2120
   20536        2144
   20524        2164
   20516        2188
   20512        2212
   20508        2236
   20500        2260
   20488        2280
   20476        2304
   20472        2328
   20476        2352
   20460        2376
   20456        2396
   20452        2420
   20452        2444
   20436        2468
   20432        2492
   20432        2516
   20424        2536
   20420        2560
   20408        2584
   20396        2608
   20388        2628
   20380        2652
   20364        2676
   20364        2700
   20360        2724
   20352        2744
   20344        2768
   20336        2792
   20332        2812
   20328        2836
   20332        2860
   20340        2888
   20356        2912
   20380        2940
   20428        2968
   20452        2996
   20496        3024
   20532        3052
   20568        3080
   20628        3112
   20652        3140
   20728        3172
   20772        3200
   20868        3260
   20864        3284
   20864        3308
   20868        3332
   20860        3356
   20884        3384
   20884        3408
   20912        3436
   20944        3464
   20948        3488
   20948        3512
   20932        3536
   20940        3564

Solution

It may be just a coincidence, but the lump you show looks fairly parabolic. It's not completely clear what you mean by "know the approximate dimension of the lump region" but if you mean that you know approximately how wide it is (i.e. how much of the x-axis it takes up), you could simply slide a window of that width along the x-axis and do a parabolic fit (a.k.a. polyfit with degree 2) to all data that fits into the window at each point. Then, compute r^2 goodness-of-fit values at each point and the point with the r^2 closest to 1.0 would be the best fit. You'd probably need a threshold value and to throw out those where the x^2 coefficient was positive (to find lumps rather than dips) for sanity, but this might be a workable approach.

Even if the parabolic look is a coincidence, I think this would ba a reasonable approach--a downward pointing parabola is a pretty good description of a general "lump" by any definition I can think of.

Edit: Attempted Implementation Below

I got curious and went ahead and implemented my proposed solution (with slight modifications). First, here's the code (ugly but functional):

function [x, p] = find_lump(data, width)

  n = size(data, 1);

  f = plot(data(:,1),data(:,2), 'bx-');
  hold on;

  bestX = -inf;
  bestP = [];
  bestMSE = inf;
  bestXdat = [];
  bestYfit = [];

  spanStart = 0;
  spanStop = 1;
  spanWidth = 0;

  while (spanStop < n)
    if (spanStart > 0)
      % Drop first segment from window (since we'll advance x):
      spanWidth = spanWidth - (data(spanStart + 1, 1) - x);
    end

    spanStart = spanStart + 1;
    x = data(spanStart, 1);

    % Advance spanStop index to maintain window width:
    while ((spanStop < n) && (spanWidth <= width))
      spanStop = spanStop + 1;
      spanWidth = data(spanStop, 1) - x;
    end

    % Correct for overshoot:
    if (spanWidth > width) 
      spanStop = spanStop - 1;
      spanWidth = data(spanStop, 1) - x;
    end

    % Fit parabola to data in the current window:
    xdat = data(spanStart:spanStop, 1);
    ydat = data(spanStart:spanStop, 2);
    p = polyfit(xdat, ydat, 2);

    % Compute fit quality (mean squared error):
    yfit = polyval(p,xdat);
    r = yfit - ydat;
    mse = (r' * r) / size(xdat,1);

    if ((p(1) < -0.002) && (mse < bestMSE))
      bestMSE = mse;
      bestX = x;
      bestP = p;
      bestXdat = xdat;
      bestYfit = yfit;
    end
  end

  x = bestX;
  p = bestP;

  plot(bestXdat,bestYfit,'r-');

...and here's a result using the given data (I swapped the columns so column 1 is x values and column 2 is y values) with a window width parameter of 750:

Data plot with lump fit drawn in

Comments:

I opted to use mean squared error between the fit parabola and the original data within each window as the quality metric, rather than correlation coefficient (r^2 value) due to laziness more than anything else. I don't think the results would be much different the other way.

The output is heavily dependent on the threshold value chosen for the quadratic coefficient (see the bestMSE condition at the end of the loop). Truth be told, I cheated here by outputing the fit coefficients at each point, then selected the threshold based on the known lump shape. This is equivalent to using a lump template as suggested by @chaohuang and may not be very robust depending on the expected variance in the data.

Note that some sort of shape control parameter seems to be necessary if this approach is used. The reason is that any random (smooth) run of data can be fit nicely to some parabola, but not necessarily around the maximum value. Here's a result where I set the threshold to zero and thus only restricted the fit to parabolas pointing downwards:

Best fit is not a "lump"

An improvement would be to add a check that the fit parabola at least has a maximum within the window interval (that is, check that the first derivative goes to zero within the window so we at least find local maxima along the curve). This alone is not sufficient as you still might have a tiny little lump that fits a parabola better than an "obvious" big lump as seen in the given data set.