Search code examples
awkgnuplot

Mean and Confidence Interval to plot


I have a data file with tab-separated columns which looks like this:

6   27  4   12  20  100 50000   false   0.1 "DFSA"  2   201414  31.47408
3   27  4   12  20  100 50000   false   0.1 "DFSA"  2   204236  31.91436
4   27  4   12  20  100 50000   false   0.1 "DFSA"  2   206964  32.09382
8   27  4   12  20  100 50000   false   0.1 "DFSA"  2   203379  31.88138
1   27  4   12  20  100 50000   false   0.1 "DFSA"  2   207287  32.4096
2   27  4   12  20  100 50000   false   0.1 "DFSA"  2   207115  32.61346
7   27  4   12  20  100 50000   false   0.1 "DFSA"  2   207751  32.77364
5   27  4   12  20  100 50000   false   0.1 "DFSA"  2   209856  32.77856
13  27  4   12  20  100 50000   false   0.1 "DFSA"  2   204257  32.08478
10  27  4   12  20  100 50000   false   0.1 "DFSA"  2   207335  32.22742
9   27  4   12  20  100 50000   false   0.1 "DFSA"  2   200518  31.64624
11  27  4   12  20  100 50000   false   0.1 "DFSA"  2   202415  32.0591
16  27  4   12  20  100 50000   false   0.1 "DFSA"  2   201548  31.59604
12  27  4   12  20  100 50000   false   0.1 "DFSA"  2   213840  33.27478
15  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208926  32.37782
14  27  4   12  20  100 50000   false   0.1 "DFSA"  2   210572  32.81294
17  27  4   12  20  100 50000   false   0.1 "DFSA"  2   204724  32.26324
20  27  4   12  20  100 50000   false   0.1 "DFSA"  2   199169  31.53494
18  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208179  32.73408
19  27  4   12  20  100 50000   false   0.1 "DFSA"  2   204342  31.82608
21  27  4   12  20  100 50000   false   0.1 "DFSA"  2   205068  31.986
24  27  4   12  20  100 50000   false   0.1 "DFSA"  2   207798  32.49448
22  27  4   12  20  100 50000   false   0.1 "DFSA"  2   207499  32.60746
23  27  4   12  20  100 50000   false   0.1 "DFSA"  2   214065  33.17844
25  27  4   12  20  100 50000   false   0.1 "DFSA"  2   198386  31.4488
26  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208307  32.77412
28  27  4   12  20  100 50000   false   0.1 "DFSA"  2   208669  32.77474
27  27  4   12  20  100 50000   false   0.1 "DFSA"  2   214324  33.16506
30  27  4   12  20  100 50000   false   0.1 "DFSA"  2   209924  32.77186
29  27  4   12  20  100 50000   false   0.1 "DFSA"  2   212959  33.3218
32  27  4   12  20  100 50000   false   0.2 "DFSA"  2   486241  64.3479
31  27  4   12  20  100 50000   false   0.2 "DFSA"  2   487487  64.65076
33  27  4   12  20  100 50000   false   0.2 "DFSA"  2   494703  65.06718
34  27  4   12  20  100 50000   false   0.2 "DFSA"  2   488164  64.77828
36  27  4   12  20  100 50000   false   0.2 "DFSA"  2   476513  63.3158
35  27  4   12  20  100 50000   false   0.2 "DFSA"  2   491005  65.02426
38  27  4   12  20  100 50000   false   0.2 "DFSA"  2   487454  64.44962
37  27  4   12  20  100 50000   false   0.2 "DFSA"  2   490494  65.06572
39  27  4   12  20  100 50000   false   0.2 "DFSA"  2   472081  63.31234
40  27  4   12  20  100 50000   false   0.2 "DFSA"  2   498294  65.02114

The data continues down growing the value in column 9 (30 rows with value 0.1, 30 rows with value 0.2, another 30 rows with 0.3 and so on). I would like to plot with yerrorlines in Gnuplot the mean value column 13 for each of the 30 rows with the same value of column 9. To make the plot, I should pass Gnuplot column 9 value, mean of column 13 of the 30 rows, and the min value and the max value of the confidence interval (e.g. 95%). In the case of 0.1, the row to pass to Gnuplot should be like the following:

0.1  36.08  35.83  36.33
0.2  .....  .....  .....
0.3  .....  .....  .....
...  .....  .....  .....

Assuming a 95% confidence interval is +0.25 and +0.25 around the mean of 36.08

Thank you for your help Regards


Solution

  • I know the question has the tag awk, but since gnuplot was mentioned, may I suggest a gnuplot-only solution?

    Gnuplot has a smooth unique function that would look for unique values (in your case column 9) and do a simple average of the y-values (in your case column 13). You can plot this directly, but with a few extra lines you can make your output much better.

    Using the input file from the question, and the following code:

    set term png
    set out "test.png"
    
    set table "tab.tmp"
    plot "data.txt" u 9:13 smooth unique
    unset table
    
    set xrange[0.09:0.21]
    set yrange[25:75]
    
    set style fill transparent solid 0.2 noborder
    
    set key top left
    set multiplot layout 2,1
    
    plot "tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2*0.95):($2*1.05) 
         with filledcurves title '95% confidence', \
         "tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2) 
         with lp lt 1 pt 7 ps 1.0 lw 3 title 'mean value'
    
    plot "tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2):($2*0.95):($2*1.05) \
         with yerrorlines lt 1 pt 7 ps 1.0 lw 3 title 'value+confidence'
    

    The set table part creates a new table with unique values and their average:

    # Curve 0 of 1, 3 points
    # Curve title: ""tmp" u 9:13"
    # x y type
    0.1  32.3633  i
    0.2  64.5033  i
    0.1  32.0938  u
    

    Then you can treat this file as a regular data file, and plot as you wish. The only trick comes here: (strcol(3) eq "u" ? 1/0: $2). This is a simple conditional telling gnuplot to ignore lines where column 3 is "u" (undefined). Then you can directly calculate your confidence interval as you wish.

    Running these lines would give you the following output, with two suggested layouts: enter image description here

    A bit of help from here and here

    Anyway, hope it helps!