Search code examples
shellawkmetacharacters

awk ignore the field delimiter pipe inside double quotes


I know this question is already answered but with comma as a separator. How to make awk ignore the field delimiter inside double quotes?

But My file is separated by pipe, when I use this in regex it act as a regex only and not getting proper output. I do not use awk extensively.. my requirement is add single slash before pipe character if it is coming in value.

As file size is almost 5GB, thought to select particular column and escaped the pipe.

INPUT:

"first | last | name" |" steve | white | black"| exp | 12
school |" home | school "| year | 2016
company |" private ltd "| joining | 2019

Expected Output:

"first \| last \| name" |" steve \| white \| black "| exp | 12
school |" home \| school "| year | 2016
company |" private ltd "| joining | 2019

I tried to use gawk with gsub but no luck.. is there any alternate approach for the same?

Also if I have to check in multiple columns how I can do that?


Solution

  • Assumptions:

    • can have more than one field with embedded | character (said field will be wrapped in double quotes)
    • there may be more than one embedded | character in a single field
    • double quotes do not show up as embedded characters within other double quotes

    Setup:

    $ cat pipe.dat
    name |" steve | white "| exp | 12
    school |" home | school "| year | 2016
    company |" private ltd "| joining | 2019
    food |"pipe | one"|"pipe | two and | three"| 2022        # multiple double-quoted fields, multiple pipes between double quotes
    cars | camaro | chevy | 2033                             # no double quotes
    

    NOTE: comments added here to highlight new cases

    One awk idea:

    awk '
    BEGIN { FS=OFS="\"" }              # define field delimiters as double quote
          { for (i=2;i<=NF;i+=2)       # double quoted data resides in the even numbered fields
                gsub(/\|/,"\\|",$i)    # escape all pipe characters in field #i
            print
          }
    ' pipe.dat
    

    This generates:

    name |" steve \| white "| exp | 12
    school |" home \| school "| year | 2016
    company |" private ltd "| joining | 2019
    food |"pipe \| one"|"pipe \| two and \| three"| 2022
    cars | camaro | chevy | 2033
    

    Assuming no spaces between the | delimiter and double quotes ...

    One GNU awk idea (using the FPAT feature):

    awk -v FPAT='([^|]*)|("[^"]+")' '
    BEGIN { OFS="|" }
          { for (i=1;i<=NF;i++)
                gsub(/\|/,"\\|",$i)
            print
          }
    ' pipe.dat
    

    This also generates:

    name |" steve \| white "| exp | 12
    school |" home \| school "| year | 2016
    company |" private ltd "| joining | 2019
    food |"pipe \| one"|"pipe \| two and \| three"| 2022
    cars | camaro | chevy | 2033