Outliers between 1,5 - 3 times the interquantile range is marked with an "+" and above 3 times the IQR with an "o". But due to this data set with multiple outliers the below boxplot is very hard to read since the "+" and "o" symbols are plotted on top of each other creating what appears to be a thick red line.
I need to plot all data so removing them is not an option but I would be fine to display "longer" boxes, i.e. stretch the q1 and q4 to reach the true min/max values and skip the "+" and "o" outlier symbols. I would also be fine if just the min and max outliers was displayed.
I'm totally in the dark here and the octave boxplot documentation found here did not include any helpful examples on how to handle outliers. A search here at stackoverflow didn't get me closer to a solution either. So any help or directions is very appreciated!
How can I modify the below code to create a boxplot based on the same data set that is readable (i.e. doesn't plot outliers on top of each other creating a thick red line)?
I'm using Octave 4.2.1 64-bits on a Windows 10 machine with qt as the graphics_toolkit and with GDAL_TRANSLATE called from within Octave to handle the tif-files.
It's not an option to switch graphics_toolkit to gnuplot etc. since I haven't been able to "rotate" the plot (horizontal boxes instead of vertical). And it's in the .pdf file the results must have an effect, not only in octaves viewer.
Please forgive my totally "newbie-style" coding-work-around to get a proper high resolution pdf-exported:
pkg load statistics
clear all;
fns = glob ("*.tif");
for k=1:numel (fns)
ofn = tmpnam;
cmd = sprintf ('gdal_translate -of aaigrid "%s" "%s"', fns{k}, ofn);
[s, out] = system (cmd);
if (s != 0)
error ('calling gdal_translate failed with "%s"', out);
endif
fid = fopen (ofn, "r");
# read 6 headerlines
hdr = [];
for i=1:6
s = strsplit (fgetl (fid), " ");
hdr.(s{1}) = str2double (s{2});
endfor
d = dlmread (fid);
# check size against header
assert (size (d), [hdr.nrows hdr.ncols])
# set nodata to NA
d (d == hdr.NODATA_value) = NA;
raw{k} = d;
# create copy with existing values
raw_v{k} = d(! isna (d));
fclose (fid);
endfor
## generate plot
boxplot (raw_v)
set (gca, "xtick", 1:numel(fns),
"xticklabel", strrep (fns, ".tif", ""));
ylabel ("Plats kvar (meter)");
set (gca, "ytick", 0:50:600);
set (gca, "ygrid", "on");
set (gca, "gridlinestyle", "--");
set (gcf, "paperunit", "centimeters", "papersize", [35, 60], "paperposition", [0 0 60 30], "paperorientation", "landscape")
zoom (0.95)
view ([90 90])
print ("loudden_box_dotted.pdf", "-F:14")
I would just delete the outliers. This is easy because the handles are returned. I've also included some caching algorithm so you don't have to reload all tifs if you are playing with plots. Splitting the conversion, processing and plotting in different scripts is always a good idea (but not for stackoverflow where minimalistic examples are prefered). Here we go:
pkg load statistics
cache_fn = "input.raw";
# only process tif if not already done
if (! exist (cache_fn, "file"))
fns = glob ("*.tif");
for k=1:numel (fns)
ofn = tmpnam;
cmd = sprintf ('gdal_translate -of aaigrid "%s" "%s"', fns{k}, ofn);
printf ("calling '%s'...\n", cmd);
fflush (stdout);
[s, out] = system (cmd);
if (s != 0)
error ('calling gdal_translate failed with "%s"', out);
endif
fid = fopen (ofn, "r");
# read 6 headerlines
hdr = [];
for i=1:6
s = strsplit (fgetl (fid), " ");
hdr.(s{1}) = str2double (s{2});
endfor
d = dlmread (fid);
# check size against header
assert (size (d), [hdr.nrows hdr.ncols])
# set nodata to NA
d (d == hdr.NODATA_value) = NA;
raw{k} = d;
# create copy with existing values
raw_v{k} = d(! isna (d));
fclose (fid);
endfor
# save result
save (cache_fn, "raw_v", "fns");
else
load (cache_fn)
endif
## generate plot
[s, h] = boxplot (raw_v);
## in h you'll find now box, whisker, median, outliers and outliers2
## delete them
delete (h.outliers)
delete (h.outliers2)
set (gca, "xtick", 1:numel(fns),
"xticklabel", strrep (fns, ".tif", ""));
ylabel ("Plats kvar (meter)");
set (gca, "ytick", 0:50:600);
set (gca, "ygrid", "on");
set (gca, "gridlinestyle", "--");
set (gcf, "paperunit", "centimeters", "papersize", [35, 60], "paperposition", [0 0 60 30], "paperorientation", "landscape")
zoom (0.95)
view ([90 90])
print ("loudden_box_dotted.pdf", "-F:14")
gives