Search code examples
regexssisnotepad++ssis-2012ssis-2008

Regex to extract SSIS formulas in Notepad++


I have a long list of formulas in Microsoft SSIS that I have to translate into SQL queries. All the formulas of my list are basically made with the following rules:

  • PART_1 ? PART_2 : PART_3 means if PART_1 (test) is true then PART_2, if false then PART_3
  • && means AND (for different conditions in PART_1 (test)

My target is to extract in Notepad++ the PART_1, PART_2 and PART_3 from most external expressions, like the following examples:

Case I

Expression: A>0 ? B : C

  • PART_1 (test) => A>0
  • PART_2 (if true) => B
  • PART_3 (if false) => C

Case II

Expression: A>0 && (A>10 ? A : -A) ? B : C

  • PART_1 (test) => A>0 && (A>10 ? A : -A)
  • PART_2 (if true) => B
  • PART_3 (if false) => C

Case III

Expression: A>0 ? B : (C>14 ? A*14 : -4)

  • PART_1 (test) => A>0
  • PART_2 (if true) => B
  • PART_3 (if false) => (C>14 ? A*14 : -4)

Case IV

Expression: A>0 ? (A>0 ? 2-B : C) : (C>14 ? A*14 : -4)

  • PART_1 (test) => A>0
  • PART_2 (if true) => (A>0 ? 2-B : C)
  • PART_3 (if false) => (C>14 ? A*14 : -4)

What I have done

In Notepad++ I have made the following REGEX to extract the three parts

PART_1 (test)

.*?\?

PART_2 (if true)

\?.*?:

PART_3 (if false)

:.*

without flag the ". matches newline" option.

The main problem is that they match (of course) everything and not only the most external expressions so I can't use them with nested formulas. How can I fix this problem?


Solution

  • Balanced Parentheses

    The basic problem comes down to "balanced parentheses", which can be matched with

    \((?>[^()]|(?R))*\)
    

    See here:

    Regular expression to match balanced parentheses

    3 Capture Groups

    This expression captures an whole expression into the 3 capture groups

    ^((?:\((?>[^()]|(?R))*\)|[^?\(])+?)\s*\?\s*((?:\((?>[^()]|(?R))*\)|[^:\(\s])+?)\s*:\s*((?:\((?>[^()]|(?R))*\)|[^\$\s])+)$
    

    Yes, it can be made simpler, but we use this to separate the components into the 3 expressions for the 3 components

    "condition" (part 1)

    ^(?:(?:\((?>[^()]|(?R))*\)|[^?\(])+?)(?=\s*\?)
    

    Assumes starts at the beginning of a line. (?=\s*\?) is a positive-lookahead.

    "value_if_true" (part 2)

    ^((?:\((?>[^()]|(?R))*\)|[^?\(])+?)\s*\?\s*\K((?:\((?>[^()]|(?R))*\)|[^:\(\s])+?)(?=\s*:)
    

    Basically, this one is made up from the "condition" regex then a \K which throws it away (so it's not captured) and then what we want to match followed by a lookahead. We use the \K here because the regex engine (boost) Notepad++ I don't think supports infinite repetition in "positive-lookbehind".

    "value_if_false" (part 3)

    :\s*\K((?:\((?>[^()]|(?R))*\)|[^\$\s])+)$
    

    Again using \K here to throw that bit away.