Search code examples
awksed

remove pipe and double quote in a string


im having some challenges in 1 CSV files im trying to decode. as a delimiter i have pipe to identified column numbers and double quote to identify big string column.

I need to do prep processing script to remove this unwanted chars and for the sample2 add the missing double quote.

sample 1|Pending|"XXX LPA. LTD. - GridBird TEST (Monthly) (Collage Maker | Photo Editor)"|749|...
sample 2|Pending|"Take Five Entertainment" Limited Liability Partnership - Pack with cats "Halloween" (Cat Simulator: My Pets)|...

desire output:

sample 1|Pending|"XXX LPA. LTD. - GridBird TEST (Monthly) (Collage Maker Photo Editor)"|749|...
sample 2|Pending|"Take Five Entertainment Limited Liability Partnership - Pack with cats Halloween (Cat Simulator: My Pets)"|749|...

im trying to regex to identify the first sample and replace with sed but it seems im missing some sed behavior how it works as first i need to identify the string pattern and then replace only that chars in this case any | or "

^.*\|\".*(\||\").*(\"\|\d|\|\d)

on this regex I'm identifying the string between |" and ("|digits or just |digits).

Next step is what I'm confused how to do with Sed.

Best Regards

Solution for this based on new samples i got. Thank you @Ed Morton

awk '
    {
        {
                gsub(/\{/, " ");
                gsub(/\}/, " ");

        # First split the record on `"`s and replace every `|` (plus
        # any surrounding spaces) inside a quoted field with a blank.
        FS=OFS="\""; $0=$0
        for ( i=2; i<=NF; i+=2 ) {
            #print $i
            gsub(/[[:space:]][|][[:space:]]*/, " ", $i)
        }

        FS=OFS="\""; $0=$0
        for ( i=2; i<=NF; i+=2 ) {
            #print $i
            if ($i ~ /[#\-\+\?A-Za-z][|][#\-\+\?A-Za-z]*/)
            {
            #print $i
            sub(/[|]{1}/, " ", $i)  
            }
        }


        # Now split on `|`s, remove `"`s within fields and
        # add `"`s around every field that contained `"`s
        FS=OFS="|"; $0=$0;
        for ( i=1; i<=NF; i++ ) {
            gsub(/"/, "", $i);
        }

        print
    }

Solution

  • Here's one way to get the output you provided from the sample input you provided using any POSIX awk:

    $ awk '
        {
            # First split the record on `"`s and replace every `|` (plus
            # any surrounding spaces) inside a quoted field with a blank.
            FS=OFS="\""; $0=$0
            for ( i=2; i<=NF; i+=2 ) {
                gsub(/[[:space:]]*[|][[:space:]]*/, " ", $i)
            }
    
            # Now split on `|`s, remove `"`s within fields and
            # add `"`s around every field that contained `"`s
            FS=OFS="|"; $0=$0;
            for ( i=1; i<=NF; i++ ) {
                if ( gsub(/"/, "", $i) ) {
                    $i = "\"" $i "\""
                }
            }
    
            print
        }
    ' file
    sample 1|Pending|"XXX LPA. LTD. - GridBird TEST (Monthly) (Collage Maker Photo Editor)"|749|...
    sample 2|Pending|"Take Five Entertainment Limited Liability Partnership - Pack with cats Halloween (Cat Simulator: My Pets)"|...