Search code examples
regexedifact

Matching double line breaks using Regex


I am writing a Regex that will extract the various pieces of information from an EDIFACT UN Codes List. As there are tens of thousands of codes I do not wish to type them all in so I have decided to use Regex to parse the text file and extract out the bits that I need. The text file is structured in a way that I can easily identify the bits that I want.

I have created the following Regex using Regex Hero to test it, but I just cannot get it to match everything up to a double line break for the codeComment group. I have tried using the character class [^\n\n] but this still won't match double line breaks.

Note: I have selected the Multiline option on Regex Hero.

(?<element>\d+)\s\s(?<elementName>.*)\[[B|C|I]\]\s+Desc: (?<desc>[^\n]*\s*[^\n]*)
^\s*Repr: (?<type>a(?:n)?)..(?<length>\d+)
^\s*(?<code>\d+)\s*(?<codeName>[^\n]*)
^\s{14}(?<codeComment>[^\n]*)

This is the example text I am using to match.

----------------------------------------------------------------------

  • 1073 Document line action code [B]

    Desc: Code indicating an action associated with a line of a
        document.

    Repr: an..3

    1 Included in document/transaction
        The document line is included in the
        document/transaction.
        should capture this as well.

    2 Excluded from document/transaction
        The document line is excluded from the
        document/transaction.

What I want is for codeComment to contain the following:

The document line is included in the
          document/transaction.
          should capture this as well.

but it is only extracting the first line:

The document line is included in the

Solution

  • In a character class, every character counts once, no matter how often you write it. So a character class can't be used to check for consecutive linebreaks. But you can use a lookahead assertion:

    ^\s{14}(?<codeComment>(?s)(?:(?!\n\n).)*)
    

    (?s) switches on singleline mode (to allow the dot to match newlines).

    (?!\n\n) asserts that there are no two consecutive linebreaks at the current position.