Search code examples
regexgrepnestedpcre

replacing nested backreferences in regular expression


I've extracted a column from an Excel spreadsheet where some cells contain multiple rows, some don't. The multi-row cells are delimited by quote marks. I want to replace the returns in those multi-row cells with ", ". So my file looks something like this:

"Alpha
 Beta
 Gamma"
 123
 456
"Apple
 Banana
 Cherry"
 789
 this is a single-line cell

My goal is

 Alpha, Beta, Gamma
 123
 456
 Apple, Banana, Cherry
 789
 this is a single-line cell

I can write a pattern that captures those multi-line cells using a nested pattern, but I don't know how to "reach inside" to get at the backreference.

"(([^"]+)\n)+([^"]+)"

Solution

  • Try:

    $ sed '/^"/{:a; /"$/bb; N; ba; :b; s/\n//g}' file
    "Alpha Beta Gamma"
     123
     456
    "Apple Banana Cherry"
     789
     this is a single-line cell
    

    How it works

    • /^"/{...}

      For any lines that begins with ", the commands in curly braces will be performed.

    • :a

      This creates a label a.

    • /"$/bb

      If the current pattern space ends with a ", then branch to label b.

    • N

      Read a new line into the pattern space.

    • ba

      Branch back to label a.

    • :b

      This defines label b.

    • s/\n//g

      Remove any newline characters from the pattern space.