I had started working with Gnuplot and tried out a few things. Now, I was wondering how to automatically remove outliers from the fit. An example is shown in the figure with a data point at 4,50 from the second data set. And the data set:
I've found a similar question here, but I couldn't make it work for my example. There might be a lot of different approaches and I'm not that experienced with Gnuplot or similar software. So, I would be glad about suggestions, what would be a possible approach to describe outliers.
I'm using the gnuplottex package in LaTeX (texlive) on Windows 10. The gnuplot code:
\begin{gnuplot}[terminal=tikz, terminaloptions={color size 7cm,5cm}]
reset session
$Data <<EOD
#data
x y1 y2 y3 y4
1 1 6 4 2
2 4 10 1 1
3 9 15 0 0.5
4 16 50 1 2
5 25 31 4 5
6 36 42 9 12
7 49 55 30 23
EOD
datafile = 'data.dat'
set print 'parameters.dat'
#_____________Set the label for data points________________________
set key top left # set position of legend
set key Left # set raggedleft
set key samplen 2 spacing 1.2 font ",8" # set fontsize and spacing
set key noautotitle
###1__________Define function and number of columns_________________________
f(x,a,b,c) = a*(x-b)**2 + c
colMin = 2
colMax = 5
set fit quiet nolog
array A[colMax]
array B[colMax]
array C[colMax]
do for [col=colMin:colMax] {
a=1; b=1; c=4 # some initial values, sometimes 0 or NaN is not a good start
fit f(x,a,b,c) datafile u 1:col via a,b,c
A[col] = a; B[col] = b; C[col] = c
print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
}
plot for [col=colMin:colMax] datafile u 1:col ls col, \
for [col=colMin:colMax] f(x,A[col],B[col],C[col]) ls col, \
for [col=colMin:colMax] keyentry w lp ls col \
title sprintf("$y%d$",col-1)
\end{gnuplot}
As mentioned in the comments you have to somehow define what you consider as outlier. There are certainly several ways how to do that. I'm not claiming that this is the best way, just consider it as a starting point.
Some Comments:
OutlierDist
what you consider as outlier$NOOUTLIERS
and if the absolute distance to the fitted curve is >=OutlierDist
then write NaN
into the second column and the original value into the 3rd column.This can certainly be optimized.
Data: "SO77774328.dat
x y1 y2 y3 y4
1 1 6 4 2
2 4 10 1 1
3 9 15 0 0.5
4 16 50 1 2
5 25 31 4 5
6 36 42 9 12
7 49 55 30 23
Script:
### remove outliers for fitting
reset session
FILE = "SO77774328.dat"
PARAMS_1 = "SO77774328_1.par"
PARAMS_2 = "SO77774328_2.par"
f(x,a,b,c) = a*(x-b)**2 + c
colMin = 2
colMax = 5
set fit quiet nolog
array A[colMax]
array B[colMax]
array C[colMax]
set print PARAMS_1
do for [col=colMin:colMax] {
a=1; b=1; c=4 # some initial values, sometimes 0 or NaN is not a good start
fit f(x,a,b,c) FILE u 1:col via a,b,c
A[col] = a; B[col] = b; C[col] = c
print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
}
unset print
# write data to table with outliers --> NaN
OutlierDist = 10 # outlier distance
dev(colX,colY) = abs(column(colY)-f(column(colX),A[colY],B[colY],C[colY])-1) >= OutlierDist ? NaN : column(colY)
set table $NOOUTLIERS
do for [colY=colMin:colMax] {
plot FILE u 1:(v0=dev(1,colY)):(v0!=v0?column(colY):NaN) lc var
}
unset table
# fit again
set print PARAMS_2
do for [col=colMin:colMax] {
i = col-colMin # datablock index
a=1; b=1; c=4 # some initial values, sometimes 0 or NaN is not a good start
fit f(x,a,b,c) $NOOUTLIERS index i u 1:2 via a,b,c
A[col] = a; B[col] = b; C[col] = c
print sprintf ('%d %.4f %.4f %.4f',col-1,A[col],B[col],C[col])
}
unset print
set key noautotitle left top
plot for [col=colMin:colMax] FILE u 1:col ls col-1, \
for [col=colMin:colMax] f(x,A[col],B[col],C[col]) ls col-1, \
for [col=colMin:colMax] keyentry w lp ls col-1 title sprintf("y%d",col-1), \
$NOOUTLIERS u 1:(valid(2) ? NaN : column(3)) w p pt 6 ps 2 lc "red" ti "Outlier"
### end of script
Result: