I have a data file with tab-separated columns which looks like this:
6 27 4 12 20 100 50000 false 0.1 "DFSA" 2 201414 31.47408
3 27 4 12 20 100 50000 false 0.1 "DFSA" 2 204236 31.91436
4 27 4 12 20 100 50000 false 0.1 "DFSA" 2 206964 32.09382
8 27 4 12 20 100 50000 false 0.1 "DFSA" 2 203379 31.88138
1 27 4 12 20 100 50000 false 0.1 "DFSA" 2 207287 32.4096
2 27 4 12 20 100 50000 false 0.1 "DFSA" 2 207115 32.61346
7 27 4 12 20 100 50000 false 0.1 "DFSA" 2 207751 32.77364
5 27 4 12 20 100 50000 false 0.1 "DFSA" 2 209856 32.77856
13 27 4 12 20 100 50000 false 0.1 "DFSA" 2 204257 32.08478
10 27 4 12 20 100 50000 false 0.1 "DFSA" 2 207335 32.22742
9 27 4 12 20 100 50000 false 0.1 "DFSA" 2 200518 31.64624
11 27 4 12 20 100 50000 false 0.1 "DFSA" 2 202415 32.0591
16 27 4 12 20 100 50000 false 0.1 "DFSA" 2 201548 31.59604
12 27 4 12 20 100 50000 false 0.1 "DFSA" 2 213840 33.27478
15 27 4 12 20 100 50000 false 0.1 "DFSA" 2 208926 32.37782
14 27 4 12 20 100 50000 false 0.1 "DFSA" 2 210572 32.81294
17 27 4 12 20 100 50000 false 0.1 "DFSA" 2 204724 32.26324
20 27 4 12 20 100 50000 false 0.1 "DFSA" 2 199169 31.53494
18 27 4 12 20 100 50000 false 0.1 "DFSA" 2 208179 32.73408
19 27 4 12 20 100 50000 false 0.1 "DFSA" 2 204342 31.82608
21 27 4 12 20 100 50000 false 0.1 "DFSA" 2 205068 31.986
24 27 4 12 20 100 50000 false 0.1 "DFSA" 2 207798 32.49448
22 27 4 12 20 100 50000 false 0.1 "DFSA" 2 207499 32.60746
23 27 4 12 20 100 50000 false 0.1 "DFSA" 2 214065 33.17844
25 27 4 12 20 100 50000 false 0.1 "DFSA" 2 198386 31.4488
26 27 4 12 20 100 50000 false 0.1 "DFSA" 2 208307 32.77412
28 27 4 12 20 100 50000 false 0.1 "DFSA" 2 208669 32.77474
27 27 4 12 20 100 50000 false 0.1 "DFSA" 2 214324 33.16506
30 27 4 12 20 100 50000 false 0.1 "DFSA" 2 209924 32.77186
29 27 4 12 20 100 50000 false 0.1 "DFSA" 2 212959 33.3218
32 27 4 12 20 100 50000 false 0.2 "DFSA" 2 486241 64.3479
31 27 4 12 20 100 50000 false 0.2 "DFSA" 2 487487 64.65076
33 27 4 12 20 100 50000 false 0.2 "DFSA" 2 494703 65.06718
34 27 4 12 20 100 50000 false 0.2 "DFSA" 2 488164 64.77828
36 27 4 12 20 100 50000 false 0.2 "DFSA" 2 476513 63.3158
35 27 4 12 20 100 50000 false 0.2 "DFSA" 2 491005 65.02426
38 27 4 12 20 100 50000 false 0.2 "DFSA" 2 487454 64.44962
37 27 4 12 20 100 50000 false 0.2 "DFSA" 2 490494 65.06572
39 27 4 12 20 100 50000 false 0.2 "DFSA" 2 472081 63.31234
40 27 4 12 20 100 50000 false 0.2 "DFSA" 2 498294 65.02114
The data continues down growing the value in column 9 (30 rows with value 0.1, 30 rows with value 0.2, another 30 rows with 0.3 and so on).
I would like to plot with yerrorlines
in Gnuplot the mean value column 13 for each of the 30 rows with the same value of column 9. To make the plot, I should pass Gnuplot column 9 value
, mean of column 13
of the 30 rows, and the min value and the max value of the confidence interval (e.g. 95%). In the case of 0.1, the row to pass to Gnuplot should be like the following:
0.1 36.08 35.83 36.33
0.2 ..... ..... .....
0.3 ..... ..... .....
... ..... ..... .....
Assuming a 95% confidence interval is +0.25 and +0.25 around the mean of 36.08
Thank you for your help Regards
I know the question has the tag awk
, but since gnuplot
was mentioned, may I suggest a gnuplot-only solution?
Gnuplot has a smooth unique
function that would look for unique values (in your case column 9) and do a simple average of the y-values (in your case column 13). You can plot this directly, but with a few extra lines you can make your output much better.
Using the input file from the question, and the following code:
set term png
set out "test.png"
set table "tab.tmp"
plot "data.txt" u 9:13 smooth unique
unset table
set xrange[0.09:0.21]
set yrange[25:75]
set style fill transparent solid 0.2 noborder
set key top left
set multiplot layout 2,1
plot "tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2*0.95):($2*1.05)
with filledcurves title '95% confidence', \
"tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2)
with lp lt 1 pt 7 ps 1.0 lw 3 title 'mean value'
plot "tab.tmp" u 1:(strcol(3) eq "u" ? 1/0: $2):($2*0.95):($2*1.05) \
with yerrorlines lt 1 pt 7 ps 1.0 lw 3 title 'value+confidence'
The set table
part creates a new table with unique values and their average:
# Curve 0 of 1, 3 points
# Curve title: ""tmp" u 9:13"
# x y type
0.1 32.3633 i
0.2 64.5033 i
0.1 32.0938 u
Then you can treat this file as a regular data file, and plot as you wish. The only trick comes here: (strcol(3) eq "u" ? 1/0: $2)
. This is a simple conditional telling gnuplot to ignore lines where column 3 is "u" (undefined). Then you can directly calculate your confidence interval as you wish.
Running these lines would give you the following output, with two suggested layouts:
A bit of help from here and here
Anyway, hope it helps!