Search code examples
octaveboxplotreadabilityoutliers

How to handle outliers in this octave boxplot for improved readability


Outliers between 1,5 - 3 times the interquantile range is marked with an "+" and above 3 times the IQR with an "o". But due to this data set with multiple outliers the below boxplot is very hard to read since the "+" and "o" symbols are plotted on top of each other creating what appears to be a thick red line.

I need to plot all data so removing them is not an option but I would be fine to display "longer" boxes, i.e. stretch the q1 and q4 to reach the true min/max values and skip the "+" and "o" outlier symbols. I would also be fine if just the min and max outliers was displayed.

I'm totally in the dark here and the octave boxplot documentation found here did not include any helpful examples on how to handle outliers. A search here at stackoverflow didn't get me closer to a solution either. So any help or directions is very appreciated!

How can I modify the below code to create a boxplot based on the same data set that is readable (i.e. doesn't plot outliers on top of each other creating a thick red line)?

enter image description here

I'm using Octave 4.2.1 64-bits on a Windows 10 machine with qt as the graphics_toolkit and with GDAL_TRANSLATE called from within Octave to handle the tif-files.

It's not an option to switch graphics_toolkit to gnuplot etc. since I haven't been able to "rotate" the plot (horizontal boxes instead of vertical). And it's in the .pdf file the results must have an effect, not only in octaves viewer.

Please forgive my totally "newbie-style" coding-work-around to get a proper high resolution pdf-exported:

pkg load statistics

clear all;
fns = glob ("*.tif");
for k=1:numel (fns)

  ofn = tmpnam;
  cmd = sprintf ('gdal_translate -of aaigrid "%s" "%s"', fns{k}, ofn);
  [s, out] = system (cmd);
  if (s != 0)
    error ('calling gdal_translate failed with "%s"', out);
  endif
  fid = fopen (ofn, "r");
  # read 6 headerlines
  hdr = [];
  for i=1:6
    s = strsplit (fgetl (fid), " ");
    hdr.(s{1}) = str2double (s{2});
  endfor
  d = dlmread (fid);

  # check size against header
  assert (size (d), [hdr.nrows hdr.ncols])

  # set nodata to NA
  d (d == hdr.NODATA_value) = NA;

  raw{k} = d;

  # create copy with existing values
  raw_v{k} = d(! isna (d));

  fclose (fid);

endfor

## generate plot
boxplot (raw_v)


set (gca, "xtick", 1:numel(fns),
          "xticklabel", strrep (fns, ".tif", ""));
          ylabel ("Plats kvar (meter)");

set (gca, "ytick", 0:50:600);
set (gca, "ygrid", "on");
set (gca, "gridlinestyle", "--");

set (gcf, "paperunit", "centimeters", "papersize", [35, 60], "paperposition", [0 0 60 30], "paperorientation", "landscape")          


zoom (0.95)
view ([90 90])

print ("loudden_box_dotted.pdf", "-F:14")

Solution

  • I would just delete the outliers. This is easy because the handles are returned. I've also included some caching algorithm so you don't have to reload all tifs if you are playing with plots. Splitting the conversion, processing and plotting in different scripts is always a good idea (but not for stackoverflow where minimalistic examples are prefered). Here we go:

    pkg load statistics
    
    cache_fn = "input.raw";
    
    # only process tif if not already done
    if (! exist (cache_fn, "file"))
      fns = glob ("*.tif");
      for k=1:numel (fns)
    
        ofn = tmpnam;
        cmd = sprintf ('gdal_translate -of aaigrid "%s" "%s"', fns{k}, ofn);
        printf ("calling '%s'...\n", cmd);
        fflush (stdout);
        [s, out] = system (cmd);
        if (s != 0)
          error ('calling gdal_translate failed with "%s"', out);
        endif
        fid = fopen (ofn, "r");
        # read 6 headerlines
        hdr = [];
        for i=1:6
          s = strsplit (fgetl (fid), " ");
          hdr.(s{1}) = str2double (s{2});
        endfor
        d = dlmread (fid);
    
        # check size against header
        assert (size (d), [hdr.nrows hdr.ncols])
    
        # set nodata to NA
        d (d == hdr.NODATA_value) = NA;
    
        raw{k} = d;
    
        # create copy with existing values
        raw_v{k} = d(! isna (d));
    
        fclose (fid);
    
      endfor
    
      # save result
      save (cache_fn, "raw_v", "fns");
    else
      load (cache_fn)
    endif
    
    ## generate plot
    [s, h] = boxplot (raw_v);
    
    ## in h you'll find now box, whisker, median, outliers and outliers2
    ## delete them
    delete (h.outliers)
    delete (h.outliers2)
    
    set (gca, "xtick", 1:numel(fns),
              "xticklabel", strrep (fns, ".tif", ""));
              ylabel ("Plats kvar (meter)");
    
    set (gca, "ytick", 0:50:600);
    set (gca, "ygrid", "on");
    set (gca, "gridlinestyle", "--");
    
    set (gcf, "paperunit", "centimeters", "papersize", [35, 60], "paperposition", [0 0 60 30], "paperorientation", "landscape")          
    
    zoom (0.95)
    view ([90 90])
    
    print ("loudden_box_dotted.pdf", "-F:14")
    

    gives

    generated plot