Search code examples
bashawkfull-text-search

awk: searching log on the multiple patterns


I am dealing with the analysis of the data in the many separate log filles. This is the format of each log

Finding intramodel H-bonds
Constraints relaxed by 0.6 angstroms and 20 degrees
Models used:
    1.1 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.2 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.3 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.4 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.5 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.6 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.7 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.8 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.9 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.10 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.11 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.12 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.13 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.14 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.15 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.16 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.17 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.18 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.19 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.20 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.21 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.22 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.23 SarsCov2_mol30_nsp5holoHIE_rep1.pdb
    1.24 SarsCov2_mol30_nsp5holoHIE_rep1.pdb

17 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/? HIE 163 NE2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/A LIG 888 O2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/? HIE 163 HE2    3.250  2.448
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/? GLU 166 N     SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/A LIG 888 O1   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.1/? GLU 166 H      2.817  2.027
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? THR 26 N      SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 N2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? THR 26 H       3.453  2.470
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? HIE 163 NE2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 O2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? HIE 163 HE2    3.269  2.495
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? GLU 166 N     SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 O1   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? GLU 166 H      3.555  2.634
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.4/? GLU 166 N     SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.4/A LIG 888 O1   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.4/? GLU 166 H      3.622  2.743
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.5/? GLU 166 N     SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.5/A LIG 888 O1   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.5/? GLU 166 H      2.797  1.790
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.10/? GLU 166 N    SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.10/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.10/? GLU 166 H     3.780  2.783
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.12/? GLU 166 N    SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.12/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.12/? GLU 166 H     3.273  2.541
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.14/? HIE 163 NE2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.14/A LIG 888 O2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.14/? HIE 163 HE2   3.389  2.556
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/? ASN 142 ND2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/A LIG 888 O2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/? ASN 142 2HD2  3.067  2.303
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/? GLY 143 N    SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/A LIG 888 N2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.15/? GLY 143 H     2.962  2.016
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.16/? GLU 166 N    SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.16/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.16/? GLU 166 H     2.926  1.930
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.19/? GLN 189 NE2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.19/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.19/? GLN 189 1HE2  3.026  2.212
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/? GLY 143 N    SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/? GLY 143 H     2.855  1.848
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/? HIE 163 NE2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/A LIG 888 O2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.22/? HIE 163 HE2   3.345  2.400
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.24/? GLN 189 NE2  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.24/A LIG 888 O1  SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.24/? GLN 189 1HE2  2.893  2.286

I need to consider each line after the string

H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):

From the rest lines I need to check whether three keywords:

GLU 166
HIE 163
THR 26

are present in the same index (defined as 1.1 , 1.2 ... 1.24) and then print the name of the log + the ID of the index value (in the second column). In the log, the index value is 1.2 since the three keywords are:

SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? THR 26 N      SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 N2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? THR 26 H       3.453  2.470
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? HIE 163 NE2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 O2   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? HIE 163 HE2    3.269  2.495
SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? GLU 166 N     SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/A LIG 888 O1   SarsCov2_mol30_nsp5holoHIE_rep1.pdb #1.2/? GLU 166 H      3.555  2.634

so the expected output should be:

log_name.log the patterns are found in the #1.2!

UPDATE: Finally in some tricky cases one of the search patterns may be located in different parts of the string (always preserving the same format), for example in the below example the pattern GLU 166 of the last string is located in another column compared to two other patterns belong to index #1.3

43 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.1/? THR 26 N      SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.1/A LIG 888 O    SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.1/? THR 26 H       3.355  2.554
SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.1/? GLU 166 N     SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.1/A LIG 888 O    SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.1/? GLU 166 H      3.071  2.100
SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.1/A LIG 888 N     SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.1/? THR 26 O     SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.1/A LIG 888 H      3.463  2.657
SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.2/? HIE 163 NE2   SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.2/A LIG 888 O    SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.2/? HIE 163 HE2    3.019  2.147
SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.2/A LIG 888 N     SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.2/? PHE 140 O    SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.2/A LIG 888 H      3.169  2.591
SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.3/? THR 26 N      SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.3/A LIG 888 S    SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.3/? THR 26 H       3.666  2.696
SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.3/? HIE 163 NE2   SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.3/A LIG 888 N    SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.3/? HIE 163 HE2    2.959  2.050
SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.3/A LIG 888 N     SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.3/? GLU 166 O    SarsCov2_06I_nsp5holoHIE_rep2.pdb #1.3/A LIG 888 H      3.118  2.200

I've tried to loop each log using simple bash workflow with awk code that looked for 1 pattern but could not do it with three patterns belonged to the same index #

for log in /logs/*hbondsALL_rep"${i}".log; do
  log_name=$(basename "$log" .log | cut -d'_' -f 2)
  # search only one pattern GLU 166
  i=$(awk -vn=1 '/GLU 166/ {gsub(/.*\.|\/\?/,"",$2); n=$2; exit} END {print n}' $log)
  # insert here alternative search solution which check the patterns!
  # and find the index {i} in the log
  # log_name.log the patterns are found in the # {i} 
done

may I use sed or AWK for such pattern based search intergrated in bash?


Solution

  • Assumptions:

    • lines of interest have three separate index/keyword tuples we need to compare (fields #2 / #3 / #4, fields #7 / #8 / #9, fields #12 / #13 / #14)
    • for a given line of interest the numeric portion of the three indexes (ie, fields #2, #7 and #12) is always the same (eg, #1.2/? is equivalent to #1.2/A)
    • within a file an index/keyword pair (eg, #1.2 / GLU 1661) may occur more than once
    • within a file all lines of interest are sorted by index (eg, #1.1 before #1.2 before #1.3 ...)

    One awk idea that allows the user to supply a list of keywords via a bash variable:

    keywords='GLU 166,HIE 163,THR 26'
    
    awk -v keywords="${keywords}" '
    
    function print_match() {
    
        if (length(found) == key_cnt) {                    # if all keys were found then ...
           print FILENAME,"the patterns are found in the",ndx "!"
        #  found_hb=0                                      # uncomment to  print only the first matching index in a file
        }
        delete found                                       # clear found[] array
    }
    
    BEGIN      { key_cnt=split(keywords,a,",")             # parse input parameter "keywords"
                 for (i=1;i<=key_cnt;i++)                  # convert to an associative array where keys are the array indices
                     keys[a[i]]
                 delete found                              # declare to awk that found[] is an array
               }
    
    FNR==1     { print_match()                             # new file? flush previous index details and ... 
                 found_hb=0                                # disable testing for keywords
               }
    /^H-bonds/ { found_hb=1; next }                        # enable testing for keywords
    found_hb   { split($2,a,"/")                           # obtain numeric portion of index and ...
                 new_ndx=a[1]                              # store in variable new_ndx
                 if (new_ndx != ndx) {                     # if this is a new index then ...
                    print_match()                          # flush previous index details and ...
                    ndx=new_ndx                            # make note of new index
                 }
                 for (i=3;i<=13;i=i+5) {                   # loop through keywords: fields #3/#4, #8/#9, #13/#14
                     key=$i FS $(i+1)
                     if (key in keys)                      # if we find a match then ...
                        found[key]                         # create an entry in the found[] array
                 }
               }
    END        { print_match() }                           # flush last index details
    ' test.log
    

    NOTE: for a large number of keywords (to search for) I'd probably opt for storing them in a file which in turn would require a few tweaks of this code to load said file (of keywords) into the keys[] associative array, but that's for another day and a different Q&A session ...


    Taking for a test drive ...

    NOTE: test.log is a copy of OP's original input file while test.log2 is a copy of OP's 2nd/UPDATE input file

    When keywords='GLU 166,HIE 163,THR 26':

    test.log the patterns are found in the #1.2!
    test.log2 the patterns are found in the #1.3!
    

    When keywords='GLU 166,HIE 163':

    test.log the patterns are found in the #1.1!
    test.log the patterns are found in the #1.2!           # does not print if the 'found_hb=0' line (in the print_match() function) is uncommented
    test.log2 the patterns are found in the #1.3!
    

    When keywords='ASN 142,GLY 143':

    test.log the patterns are found in the #1.15!
    

    When keywords='ASN 142,HIE 163':

               <<<=== no output