Search code examples
awksedgreptext-processing

Using SED to replace specific patterns found within parentheses?


I'm having a bit of a problem with this... I'm trying to use Bash scripting (Sed, in particular) to process the following text. Other methods are welcome, of course! But I'm hoping it could be a Bash solution...

Tricky input:

("a"|"b"|"c")."A"|"B"|"C".("e"|"f")."E"|"F"

Desired output:

("a"|"b"|"c")."ABC".("e"|"f")."EF"

Mainly, I think what I want to do is replace the strings "|" with nothing, but limit the scope of change outside of any existing text in parentheses.

The problems gets more crazy with different forms of text inputs I have with the dataset that I have. As in, the combination of blocks (delimited by .) with parentheses and non-parenthesese is varied.

Thanks in advance.


Something I've tried with SED:

gsed -E "s/(\.\"[[:graph:]]+)\"\|\"/\1/g" input.txt

output i get is:

("a"|"b"|"c")."A"|"B"|"C".("e"|"f")."EF"

Looks like I'm only getting the partially desired output...only targeting a limited scope...


Solution

  • Assumptions/understandings:

    • fields are separated by periods
    • fields wrapped in parens are to be left alone
    • all other fields have leading/trailing double quotes while all other double quotes, as well as pipes, are to be removed

    Sample data:

    $ cat pipes.dat
    ("a"|"b"|"c")."A"|"B"|"C".("e"|"f")."E"|"F"
    "j"|"K"|"L"."m"|"n"|"o"|"p".("x"|"y"|"z")
    

    One awk idea:

    awk '
    BEGIN { FS=OFS="." }                                      # define input/output field separator as a period
    
          { printf "############\nbefore: %s\n",$0            # print a record separator and the current input line;
                                                              # solely for display purposes; this line can
                                                              # be removed/commented-out once logic is verified
    
            for (i=1; i<=NF; i++)                             # loop through fields
                if ( $i !~ "^[(].*[)]$" )                     # if field does not start/end with parens then ...
                    $i="\"" gensub(/"|\|/,"","g",$i) "\""     # replace field with a new double quote (+) modified string
                                                              # whereby all double quotes and pipes are removed (+)
                                                              # a new ending double quote
    
            printf "after : %s\n",$0                          # print the newly modified line;
                                                              # can be replaced with "print" once logic is verified
          }
    ' pipes.dat                                               # read data from file; to read from a variable remove this line and ...
    #' <<< "${variable_name}"                                 # uncomment this line
    

    The above generates:

    ############
    before: ("a"|"b"|"c")."A"|"B"|"C".("e"|"f")."E"|"F"
    after : ("a"|"b"|"c")."ABC".("e"|"f")."EF"
    ############
    before: "j"|"K"|"L"."m"|"n"|"o"|"p".("x"|"y"|"z")
    after : "jKL"."mnop".("x"|"y"|"z")
    

    After removing comments and making the printf changes:

    awk '
    BEGIN { FS=OFS="." }
          { for (i=1; i<=NF; i++)
                if ( $i !~ "^[(].*[)]$" )
                    $i="\"" gensub(/"|\|/,"","g",$i) "\"" 
            print
          }
    ' pipes.dat
    

    Which generates:

    ("a"|"b"|"c")."ABC".("e"|"f")."EF"
    "jKL"."mnop".("x"|"y"|"z")