Search code examples
gnuplothistogramfrequencysmoothing

gnuplot smooth freq does not create a histogram of selected data


I'm trying to create three histograms in one chart. Selecting the appropriate data works, but "smooth freq" doesn't work as expected.

$Data <<EOD
Med     Gender  Age
4       f       33.14
4       f       53.81
4       f       32.99
4       m       39.78
4       f       25.06
2       m       51.06
4       f       39.93
4       f       44.92
2       m       45.68
2       m       73.47
2       m       61.65
4       m       26.82
4       f       24.93
4       f       29.79
3       m       80.54
3       m       81.42
2       f       71.9
2       f       73.18
3       m       64.76
4       m       33.45
2       m       58.92
2       f       73.51
4       f       36.09
EOD

The data set consists of three different groups. The following "functions" are used to select the age values ​​that belong to each group.

GROUP_LABELS = "2 3 4"
GROUP_NAMES = "Med_02 Med_03 Med_04"

is_true(c,x) = ( c == x ) ? 1.0 : NaN
age = "( column(\"Age\") )"
selected_age_values = "is_true( column(\"Med\"), i ) * @age"

x_min = 0
x_max = 100
n_bins = 20
bin_width = 1.*(x_max - x_min)/n_bins

bin(col) = floor(column(col)/bin_width)*bin_width

set boxwidth 0.5
set xtics out
set xrange[x_min:x_max]

plot for [i in GROUP_LABELS] $Data u ( @selected_age_values ):(1) smooth freq w boxes lc i-1 ti word( GROUP_NAMES, i-1 ) noenhanced

Unfortunately, the resulting chart only shows one spike for each data point, which is at least correctly colored.


Solution

  • I tried to simplify your script a bit, but three histograms into one plot make it a bit complicated again (to plot and to read).

    Since you have three histograms, each binwidth (here: 5.0) is split into 3 bars. As an example: the range from 50 to 55 contains a bar from the first group, none from the second and one from the third group. Note, that bars are plotted centered at the value, so you have to set some offset with multiples of half a boxwidth.

    The function inGroup() simply returns 1 or 0 if the i is identical to the group or not. smooth freq will then sum up either 0 or 1.

    I hope the rest is self-explaining.

    There would be different ways of representing this: for example, one xtic for each range (e.g. 50-55) and the 3 bars corresponding to that range centered around the tic.

    Script:

    ### three histograms in one plot
    reset session
    
    $Data <<EOD
    Med     Gender  Age
    4       f       33.14
    4       f       53.81
    4       f       32.99
    4       m       39.78
    4       f       25.06
    2       m       51.06
    4       f       39.93
    4       f       44.92
    2       m       45.68
    2       m       73.47
    2       m       61.65
    4       m       26.82
    4       f       24.93
    4       f       29.79
    3       m       80.54
    3       m       81.42
    2       f       71.9
    2       f       73.18
    3       m       64.76
    4       m       33.45
    2       m       58.92
    2       f       73.51
    4       f       36.09
    EOD
    
    GROUP_LABELS   = "2 3 4"
    GroupName(i)   = sprintf("Med_%02d",int(i))
    x_min          = 0
    x_max          = 100
    n_bins         = 20
    bin_width      = real(x_max - x_min)/n_bins 
    myBoxwidth     = bin_width/words(GROUP_LABELS)
    bin(x)         = floor(x/bin_width)*bin_width
    inGroup(col,i) = column(col) == int(i)
    
    set boxwidth myBoxwidth
    set xlabel "Age"
    set xrange[x_min:x_max]
    set xtics 10 out
    set mxtic 2
    set ylabel "Count"
    set ytics 1
    set grid x, mx, y
    set style fill transparent solid 0.3
    
    plot for [i in GROUP_LABELS] $Data u (bin(column("Age"))+(i-1.5)*myBoxwidth):(inGroup(1,i)) \
         smooth freq w boxes lc i-1  ti GroupName(i) noenhanced
    ### end of script
    

    Result:

    enter image description here

    Addition: xlabels showing bin ranges

    If you add the following two lines and an additional line to the plot command...

    myXtic(i) = sprintf("%d-%d",i*bin_width,(i+1)*bin_width)
    set xtics right rotate by 60 offset 2,0
    
    plot for [i in GROUP_LABELS] $Data u (bin(column("Age"))+(i-1.5)*myBoxwidth):(inGroup(1,i)) \
         smooth freq w boxes lc i-1  ti GroupName(i) noenhanced, \
         for [i=0:n_bins-1] '+' u (i*bin_width):(NaN):xtic(myXtic(i)) every ::::0 notitle
    

    ... you will get the following. The 3 bars are actually not centered around the xtic but in between two tics which define the age range. The range 50-55, actually means: 50<= age <55.

    There are certainly many more ways to create such a graph. I guess one should make it as easy as possible for the reader to understand.

    enter image description here