Search code examples
gnuplothistogram

gnuplot does not summarize all values within a histogram bin


The histogram does not summarize all values ​​within a bin. The boxes are placed in the correct bin but printed separately on top of each other. This is visible when the fill style is transparent.

Some macros were included via "load" from other scripts, but the reduced code below should show the problem. I removed label and title settings so the resulting plot looks slightly different than the attached one.

reset

set key autotitle columnhead

#=======================================================================================
# from other gnuplot scripts included via "load"
fs_reference_age_min    = 20
fs_reference_age_max    = 100

gp_is_in_range(c,a,b)   = ( c > a && c <= b ) ? 1.0 : NaN
fs_is_in_age_range(c)   = gp_is_in_range( c, fs_reference_age_min, fs_reference_age_max )
#=======================================================================================

fs_age          = 'PatientAge'
valid_age       = 'fs_is_in_age_range( column(fs_age) )'
bin(x)          = floor(x/bin_width)*bin_width

x_min           = 0
x_max           = 100
n_bins          = 20
bin_width       = real(x_max - x_min)/n_bins 
group_boxwidth  = 1

set boxwidth group_boxwidth*0.75
set style fill transparent solid 0.3

n = 4
offset = n    

$Data <<EOD
Subgroup    PatientAge
4   40.55
4   48.96
1   34.94
5   51.45
1   54.8
2   10.51
4   42.87
3   71.41
4   62.2
2   54.22
3   65.04
1   49.73
4   31.46
3   75.25
1   56.97
2   14.56
2   10.64
3   60.54
EOD

plot $Data u ( bin( column(fs_age) ) + ( offset - 0.5 ) * group_boxwidth ):( @valid_age ) smooth freq w boxes lc n ti 'NORM_DB' noenhanced

enter image description here


Solution

  • Thank you for providing a copy & paste minimal (non-)working example including data. This makes debugging much easier if one has all the information right away.

    You filter your data you are introducing NaNs. That's what you are doing with

    gp_is_in_range(c,a,b)   = ( c > a && c <= b ) ? 1.0 : NaN
    

    This is introducing breaks in your data, e.g. a line plot would be interrupted.

    So, in order to visualize, if you plot your smooth freq into a table you would see the following:

    # Curve 0 of 1, 17 points
    # Curve title: "NORM_DB"
    # x y xlow xhigh type
     33.5  1  33.5  33.5  i
     43.5  1  43.5  43.5  i
     48.5  1  48.5  48.5  i
     53.5  2  53.5  53.5  i
    
     33.5  1  33.5  33.5  i
     43.5  1  43.5  43.5  i
     48.5  1  48.5  48.5  i
     53.5  1  53.5  53.5  i
     58.5  1  58.5  58.5  i
     63.5  1  63.5  63.5  i
     68.5  1  68.5  68.5  i
     73.5  1  73.5  73.5  i
     78.5  1  78.5  78.5  i
    
     63.5  1  63.5  63.5  i
    

    That's the data you put into the smooth freq option.

    And apparently, smooth freq treats different blocks individually. That's why you get 3 histograms or bar charts on top of each other.

    So, simple solution (for gnuplot>5.0.6): before the plot command insert a line:

    set datafile missing NaN