Search code examples
sedcat

sed: removing dublicated patterns in the log file


I am working with post-processing of the log file arranged in the following format:

Finding intramodel H-bonds
Constraints relaxed by 0.55 angstroms and 20 degrees
Models used:
    1.1 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.6 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.10 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.8 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.2 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.3 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.4 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.7 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.5 SarsCov2_structure31R_nsp5holo_rep1.pdb
    1.9 SarsCov2_structure31R_nsp5holo_rep1.pdb

6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/? ASN 142 ND2   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/A UNL 1 N   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.3/? ASN 142 2HD2   3.419  2.541
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/? GLN 189 NE2   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/A UNL 1 O   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.5/? GLN 189 1HE2   2.883  2.159
SarsCov2_structure31R_nsp5holo_rep1.pdb #1.6/? HIS 163 NE2   SarsCov2_structure31R_nsp5holo_rep1.pdb #1.6/A UNL 1 O   no hydrogen  

From this log I need to take all the lines after the 3rd line, and then delete all dublicated patterns "SarsCov2_structure31R_nsp5holo_rep1.pdb". May I use some regex with sed to detect any phrase matching such patter in the log ( which ends with *.pdb) that should be removed automatically for each processed log? So the expected output should be:

Models used:
    1.1 
    1.6 
    1.10 
    1.8 
    1.2 
    1.3 
    1.4 
    1.7 
    1.5 
    1.9 

6 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
 #1.3/? ASN 142 ND2    #1.3/A UNL 1 N    #1.3/? ASN 142 2HD2   3.419  2.541
 #1.5/? GLN 189 NE2    #1.5/A UNL 1 O    #1.5/? GLN 189 1HE2   2.883  2.159
 #1.6/? HIS 163 NE2    #1.6/A UNL 1 O   no hydrogen            3.299  N/A
 #1.7/? GLN 189 NE2    #1.7/A UNL 1 O    #1.7/? GLN 189 1HE2   3.109  2.147
 #1.9/? ASN 142 ND2    #1.9/A UNL 1 O    #1.9/? ASN 142 1HD2   3.032  2.319
 #1.10/? GLN 189 NE2   #1.10/A UNL 1 O   #1.10/? GLN 189 1HE2  3.054  2.125

Here is some example without regex, which does not work yet :-)

cat test.log | tail -n +2 | sed -e "/SarsCov2_structure31R_nsp5holo_rep1.pdb/d" >> ./test2.log

Solution

  • Using sed

    $ sed 's/[[:alnum:]_]*\.pdb//g;1,2d' input_file
    Models used:
        1.1
        1.6
        1.10
        1.8
        1.2
        1.3
        1.4
        1.7
        1.5
        1.9
    
    6 H-bonds
    H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
     #1.3/? ASN 142 ND2    #1.3/A UNL 1 N    #1.3/? ASN 142 2HD2   3.419  2.541
     #1.5/? GLN 189 NE2    #1.5/A UNL 1 O    #1.5/? GLN 189 1HE2   2.883  2.159
     #1.6/? HIS 163 NE2    #1.6/A UNL 1 O   no hydrogen