Search code examples
awktextsedgrep

How to add one space character (without changing any other characters) to "one character strings" using awk, sed, or grep?


I obtained this text file using sed and awk (leap.log):

Template_frcmod
MASS

Pd 0.000         0.000 

BOND
Pd-c
Pd-3e
c-Pd
4p-ca
o-3e
n-3e
Pd-4e
3p-ca
o-4e
n-4e

ANGLE
Pd-c-Pd
Pd-3e-o
Pd-3e-n
Pd-1c-Pd
c-Pd-4p
c-Pd-3e
c-Pd-1c
c-Pd-3p
c-Pd-4e
4p-ca-ca
4p-Pd-3e
4p-Pd-1c
o-3e-n
3e-n-c3
3e-Pd-1c
ca-4p-ca
Pd-4e-o
Pd-4e-n
1c-Pd-4e
3p-ca-ca
3p-Pd-4e
o-4e-n
4e-n-c3
ca-3p-ca

DIHE

 Pd-4p-ca-ca
 Pd-3e-n-c3
 c-Pd-3e-o
 c-Pd-3e-n
 c-Pd-4e-o
 c-Pd-4e-n
 4p-Pd-3e-o
 4p-Pd-3e-n
 o-3e-n-c3
 o-3e-Pd-1c
 n-3e-Pd-1c
 ca-4p-ca-ca
 ca-ca-4p-ca
 Pd-3p-ca-ca
 Pd-4e-n-c3
 1c-Pd-4e-o
 1c-Pd-4e-n
 3p-Pd-4e-o
 3p-Pd-4e-n
 o-4e-n-c3
 ca-3p-ca-ca
 ca-ca-3p-ca

IMPROPER

NONBON

Now I have a problem with "one character" atom names:

c-Pd-4p

in this line and all other similar lines (which contain one character atom names), "c" must be two characters: "c " (with a space) :

c -Pd-4p

or in this line: 4e-n-c3 "n" must be "n " 4e-n -c3 or this line: "Pd-c" must be "Pd-c " exc.. all atom names which contains one char must be two chars and get a space char.

When I try to change "c" to "c " "1c" become "1c ": Pd-1c-Pd --> Pd-1c -Pd but I don't want to change 2 char atom names. It must be stay the same.

When try to this command:

awk 'BEGIN{FS="-"}{ if(length($2) == 1 ) $2= $2" " } {print $0}' leap.log

This time the "-" signs disappeared. What should I do to add all one character atom names with a space?

Expected results (comments jut for this question real file will have not comments):

Template_frcmod
MASS

Pd 0.000         0.000 

BOND
Pd-c  #Also the last "c" must be "c " 
Pd-3e
c -Pd
4p-ca
o -3e
n -3e
Pd-4e
3p-ca
o -4e
n -4e

ANGLE
Pd-c -Pd
Pd-3e-o 
Pd-3e-n 
Pd-1c-Pd
c -Pd-4p
c -Pd-3e
c -Pd-1c
c -Pd-3p
c -Pd-4e
4p-ca-ca
4p-Pd-3e
4p-Pd-1c
o -3e-n 
3e-n -c3
3e-Pd-1c
ca-4p-ca
Pd-4e-o 
Pd-4e-n 
1c-Pd-4e
3p-ca-ca
3p-Pd-4e
o -4e-n
4e-n -c3
ca-3p-ca

DIHE

Pd-4p-ca-ca
Pd-3e-n-c3
c -Pd-3e-o #Also the last "o" must be "o "
c -Pd-3e-n #Also the last "n" must be "n " 
c -Pd-4e-o #Also the last "o" must be "o "
c-Pd-4e-n  #Also the last "n" must be "n "  
4p-Pd-3e-o #Also the last "o" must be "o " 
4p-Pd-3e-n #Also the last "n" must be "n " 
o -3e-n-c3
o -3e-Pd-1c
n-3e-Pd-1c
ca-4p-ca-ca
ca-ca-4p-ca
Pd-3p-ca-ca
Pd-4e-n-c3
1c-Pd-4e-o
1c-Pd-4e-n
3p-Pd-4e-o
3p-Pd-4e-n
o -4e-n -c3
ca-3p-ca-ca
ca-ca-3p-ca

IMPROPER

NONBON

Solution

  • Assumptions:

    • only lines of interest are also the only lines that contain a -
    • for the lines of interest there will only be one field containing a -
    • need to test all - delimited strings and all such strings with length()==1 are to have a space ( ) appended on the end of the field
    • leading white space in a line can be ignored/removed

    One awk idea (strips leading white space):

    awk '
    /-/ { n=split($1,arr,"-")                          # split field #1 into arr[] array based on "-" delimiter
          x=delim=""
          for (i=1;i<=n;i++) {                         # loop through array
              # piece together our new field
              x=x delim arr[i] ( length(arr[i]) == 1 ? " " : "")
              delim="-"
          }
          $1=x                                         # replace field #1 with value in variable "x"
        }
    1
    ' leap.log
    

    Another awk idea (maintains leading white space):

    awk '
    BEGIN { FS=OFS="-" }                   # define input/output field delimiter == "-"
    NF>1  { for (i=1;i<=NF;i++) {          # if more than one "-" delimited field then ...
                old=$i
                gsub(/ /,"",old)           # strip any (leading) spaces from field
                if (length(old) == 1)      # if lenght() == 1 then ...
                   $i=$i " "               # append space to current field
            }
          }
    1
    ' leap.log
    

    These both generate:

    Template_frcmod
    MASS
    
    Pd 0.000         0.000
    
    BOND
    Pd-c
    Pd-3e
    c -Pd
    4p-ca
    o -3e
    n -3e
    Pd-4e
    3p-ca
    o -4e
    n -4e
    
    ANGLE
    Pd-c -Pd
    Pd-3e-o
    Pd-3e-n
    Pd-1c-Pd
    c -Pd-4p
    c -Pd-3e
    c -Pd-1c
    c -Pd-3p
    c -Pd-4e
    4p-ca-ca
    4p-Pd-3e
    4p-Pd-1c
    o -3e-n
    3e-n -c3
    3e-Pd-1c
    ca-4p-ca
    Pd-4e-o
    Pd-4e-n
    1c-Pd-4e
    3p-ca-ca
    3p-Pd-4e
    o -4e-n
    4e-n -c3
    ca-3p-ca
    
    DIHE
    
     Pd-4p-ca-ca
     Pd-3e-n -c3
     c -Pd-3e-o
     c -Pd-3e-n
     c -Pd-4e-o
     c -Pd-4e-n
     4p-Pd-3e-o
     4p-Pd-3e-n
     o -3e-n -c3
     o -3e-Pd-1c
     n -3e-Pd-1c
     ca-4p-ca-ca
     ca-ca-4p-ca
     Pd-3p-ca-ca
     Pd-4e-n -c3
     1c-Pd-4e-o
     1c-Pd-4e-n
     3p-Pd-4e-o
     3p-Pd-4e-n
     o -4e-n -c3
     ca-3p-ca-ca
     ca-ca-3p-ca
    
    
    IMPROPER
    
    NONBON
    

    NOTE: for the 1st awk script the entries under DIHE lose their leading white space