Search code examples
regexperlsedcommand-line-interfaceregexp-replace

command line: replace newline followed by character


I want to replace newlines \n in a file only when the next line starts with optional spaces and a lower than charachter \s*<.

Example Text:

FIRST LINE ('<foo>
  <bar>
<baz>')

ANOTHER LINE 'lorem ipsim', '<dolor>
        <and>
            <p>again</p>
        </and>
</dolor>'

I need to do that on the command line using sed, perl, tr, ...

I tried several command but none has worked so far. Basically it is: sed -i -e 's|\n+\s*\<|<|gm' filename

It seems like sed does not look further than the newline.

https://regex101.com/r/VkRO9o/3

Is there any command that can do that?

EDIT: Expected Output:

FIRST LINE '<foo>  <bar><baz>'

ANOTHER LINE 'lorem ipsim', '<dolor><and><p>again</p></and><dolor>'

It's fine if the spaces aren't replaced.


Solution

  • You may use perl for this:

    perl -0777 -pe "s/\h*\R+\h*([<'])/\$1/g" file
    
    FIRST LINE ('<foo><bar><baz>')
    
    ANOTHER LINE 'lorem ipsim', '<dolor><and><p>again</p></and></dolor>'
    

    RegEx Demo

    Details:

    • -0777: Enable slurp mode to match across newlines
    • /\h*\R+\h*([<']): Match 0+ horizontal whitespaces followed by 1+ line breaks followed by 0+ whitespaces and < or '. Note that we are capturing < or ' in group #1. Replace this match with an $1 which is < or ' that we've captured in group #1