Search code examples
bashawkbar-chartgnuplot

bash / awk / gnuplot: pre-processing on the data for ploting using gnuplot


Dealing with the analysis of multi-column data, organized in the following format:

#Acceptor                DonorH           Donor   Frames         Frac      AvgDist       AvgAng
lig_608@O2            GLU_166@H       GLU_166@N      708       0.7548       2.8489     160.3990
lig_608@O3             THR_26@H        THR_26@N      532       0.5672       2.8699     161.9043
THR_26@O            lig_608@H15      lig_608@N6      414       0.4414       2.8509     153.3394
lig_608@N2          HIE_163@HE2     HIE_163@NE2      199       0.2122       2.9167     156.3248
GLN_189@OE1          lig_608@H2      lig_608@N4       32       0.0341       2.8899     156.4308
THR_25@OG1          lig_608@H14      lig_608@N5       26       0.0277       2.8906     160.9933
lig_608@O4            GLY_143@H       GLY_143@N       25       0.0267       2.8647     146.5977
lig_608@O3           THR_25@HG1      THR_25@OG1       16       0.0171       2.7618     152.3421
lig_608@O2         GLN_189@HE21     GLN_189@NE2       15       0.0160       2.8947     154.3567
lig_608@N7         ASN_142@HD22     ASN_142@ND2       10       0.0107       2.9196     147.8856
lig_608@O4         ASN_142@HD21     ASN_142@ND2        9       0.0096       2.8462     148.4038
HIE_41@O            lig_608@H14      lig_608@N5        9       0.0096       2.8693     148.4560
GLN_189@NE2          lig_608@H2      lig_608@N4        7       0.0075       2.9562     153.6447
lig_608@O4         ASN_142@HD22     ASN_142@ND2        4       0.0043       2.8954     158.0293
THR_26@O            lig_608@H14      lig_608@N5        2       0.0021       2.8259     156.4279
lig_608@O4         ASN_119@HD21     ASN_119@ND2        1       0.0011       2.8786     144.1573
lig_608@N2            GLU_166@H       GLU_166@N        1       0.0011       2.9295     149.3281

My gnuplot script integrated into BASH filters data, selecting only two columns matching the conditions: 1) either index from the 1st or 3rd column excluding pattern started from "lig"; 2) values from the 5th column that are > 0.05

#!/bin/bash
output=$(pwd)
# begining pattern of each processed file
target='HBavg'

# loop each file and create a bar graph
for file in "${output}"/${target}*.log ; do
 file_name3=$(basename "$file")
 file_name2="${file_name3/.log/}"
 file_name="${file_name2/${target}_/}"
echo "vizualisation with Gnuplot!"
cat <<EOS | gnuplot > ${output}/${file_name2}.png
set term pngcairo size 800,600
### conditional xtic labels
reset session
set termoption noenhanced
set title "$file_name" font "Century,22" textcolor "#b8860b"
set tics font "Helvetica,10"

FILE = "$file"

set xlabel "Fraction, %"
set ylabel "H-bond donor, residue"
set yrange [0:1]
set key off
set style fill solid 0.5
set boxwidth 0.9
set grid y
#set xrange[-1:5]

set table \$Filtered
    myTic(col1,col2) = strcol(col1)[1:3] eq 'lig' ? strcol(col2) : strcol(col1)
    plot FILE u ((y0=column(5))>0.05 ? sprintf("%g %s",y0,myTic(1,3)) : '') w table
unset table

plot \$Filtered u 0:1:xtic(2) w boxes, '' u 0:1:1 w labels offset 0,1
### end of script
EOS
done

eventually it writes filtered data into a new table producing a multi-bar plot which looks like: enter image description here

As we may see here the bars are pre-sorted according to the values on Y (corresponded to the values from the 5th column of initial data). How would it be possible rather to sort bars according to the alphabetic order of the naming patterns displayed on X (eventually changing the order of the displayed bars on the graph)?

Since the original data is alway sorted according to the 5th column (Frac), would it be possible to resort it directly providing to Gnuplot ?

the idea may be to pipe it directly in gnuplot script with awk and sort e.g:

plot "<awk -v OFS='\t' 'NR > 1 && \$5 > 0.05' $file | sort -k1,1" using 0:5:xtic(3) with boxes

how could I do the same with my script (where the data is filtered using gnuplot and I need only to sort the bars produced via):

plot \$Filtered u 0:1:xtic(2) w boxes, '' u 0:1:1 w labels offset 0,1

Solution

  • edit: added color alternation

    I would stick to external tools for processing the data then call gnuplot:

    #!/bin/bash
    
    {
    echo '$data << EOD'
    awk 'NR > 1 && $5 > 0.05 {print ($1 ~ /^lig/ ? $2 : $1 ), $5}' file.log |
    sort -t ' ' -k1,1 |
    awk -v colors='0x4472c4 0xed7d31' '
        BEGIN { nc = split(colors,clrArr) }
        { print $0, clrArr[NR % nc + 1] }
    '
    echo 'EOD'
    
    cat << 'EOF'
    set term pngcairo size 800,600
    set title "file.log" font "Century,22" textcolor "#b8860b"
    set xtics noenhanced font "Helvetica,10"
    set xlabel "H-bond donor, residue"
    set ylabel "Fraction, %"
    set yrange [0:1]
    set key off
    set boxwidth 0.9
    set style fill solid 1.0
    plot $data using 0:2:3:xtic(1) with boxes lc rgb var, \
            '' using 0:2:2 with labels offset 0,1
    EOF
    } | gnuplot > file.png
    

    enter image description here

    remarks:

    • The problem with printing the values on top of the bars in Gnuplot is that you can't do it directly from a stream, you need a file or a variable. Here I saved the input data into the $data variable.
    • You'll be able to expand shell variables in the HEREDOC if you unquote it (<< 'EOF' => << EOF), but you have to make sure that you escape the $ of $data
    • The simplest way to add colors is to add a "color" field in the output of awk but the sorting would mess it up; that's why I add the color in an other awk after the sort.