Search code examples
gnuplotoutliers

Gnuplot - How to ignore outliers for the fit?


I had started working with Gnuplot and tried out a few things. Now, I was wondering how to automatically remove outliers from the fit. An example is shown in the figure with a data point at 4,50 from the second data set.Outlier in "data set 2" distorts the fit And the data set:

I've found a similar question here, but I couldn't make it work for my example. There might be a lot of different approaches and I'm not that experienced with Gnuplot or similar software. So, I would be glad about suggestions, what would be a possible approach to describe outliers.

I'm using the gnuplottex package in LaTeX (texlive) on Windows 10. The gnuplot code:

\begin{gnuplot}[terminal=tikz, terminaloptions={color size 7cm,5cm}]
reset session

$Data <<EOD
#data
x   y1  y2  y3  y4
1   1   6   4   2   
2   4   10  1   1   
3   9   15  0   0.5 
4   16  50  1   2   
5   25  31  4   5   
6   36  42  9   12  
7   49  55  30  23
EOD

datafile = 'data.dat'
set print 'parameters.dat'

#_____________Set the label for data points________________________
set key top left                            # set position of legend
set key Left                                # set raggedleft
set key samplen 2 spacing 1.2 font ",8" # set fontsize and spacing
set key noautotitle 

###1__________Define function and number of columns_________________________
f(x,a,b,c) = a*(x-b)**2 + c
colMin = 2
colMax = 5
set fit quiet nolog
array A[colMax]
array B[colMax]
array C[colMax]

do for [col=colMin:colMax] {
    a=1; b=1; c=4            # some initial values, sometimes 0 or NaN is not a good start
    fit f(x,a,b,c) datafile u 1:col via a,b,c
    A[col] = a;  B[col] = b;  C[col] = c
    
    print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
}

plot for [col=colMin:colMax] datafile u 1:col ls col, \
     for [col=colMin:colMax] f(x,A[col],B[col],C[col]) ls col, \
     for [col=colMin:colMax] keyentry w lp ls col \ 
     title sprintf("$y%d$",col-1)
\end{gnuplot}

Solution

  • As mentioned in the comments you have to somehow define what you consider as outlier. There are certainly several ways how to do that. I'm not claiming that this is the best way, just consider it as a starting point.

    Some Comments:

    • you do a fit with all datapoints
    • define an absolute distance OutlierDist what you consider as outlier
    • plot the data into a table $NOOUTLIERS and if the absolute distance to the fitted curve is >=OutlierDist then write NaN into the second column and the original value into the 3rd column.
    • now, fit a second time (without the outliers)
    • plot the data, the fitted curves (2nd fit) and if desired the outliers

    This can certainly be optimized.

    Data: "SO77774328.dat

    x   y1  y2  y3  y4
    1    1    6   4    2
    2    4   10   1    1
    3    9   15   0    0.5
    4   16   50   1    2
    5   25   31   4    5
    6   36   42   9   12
    7   49   55  30   23
    

    Script:

    ### remove outliers for fitting
    reset session
    
    FILE     = "SO77774328.dat"
    PARAMS_1 = "SO77774328_1.par"
    PARAMS_2 = "SO77774328_2.par"
    
    f(x,a,b,c) = a*(x-b)**2 + c
    colMin = 2
    colMax = 5
    set fit quiet nolog
    array A[colMax]
    array B[colMax]
    array C[colMax]
    
    set print PARAMS_1
    do for [col=colMin:colMax] {
        a=1; b=1; c=4            # some initial values, sometimes 0 or NaN is not a good start
        fit f(x,a,b,c) FILE u 1:col via a,b,c
        A[col] = a;  B[col] = b;  C[col] = c
        print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
    }
    unset print
    
    # write data to table with outliers --> NaN
    OutlierDist = 10   # outlier distance
    dev(colX,colY) = abs(column(colY)-f(column(colX),A[colY],B[colY],C[colY])-1) >= OutlierDist ? NaN :  column(colY)
    set table $NOOUTLIERS
        do for [colY=colMin:colMax] {
            plot FILE u 1:(v0=dev(1,colY)):(v0!=v0?column(colY):NaN) lc var
        }
    unset table
    
    # fit again
    set print PARAMS_2
    do for [col=colMin:colMax] {
        i = col-colMin   # datablock index
        a=1; b=1; c=4            # some initial values, sometimes 0 or NaN is not a good start
        fit f(x,a,b,c) $NOOUTLIERS index i u 1:2 via a,b,c
        A[col] = a;  B[col] = b;  C[col] = c
        print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
    }
    unset print
    
    set key noautotitle left top
    
    plot for [col=colMin:colMax] FILE u 1:col ls col-1, \
         for [col=colMin:colMax] f(x,A[col],B[col],C[col]) ls col-1, \
         for [col=colMin:colMax] keyentry w lp ls col-1 title sprintf("y%d",col-1), \
         $NOOUTLIERS u 1:(valid(2) ? NaN : column(3)) w p pt 6 ps 2 lc "red" ti "Outlier"
    ### end of script
    

    Result:

    enter image description here