Search code examples
pythonregexparsingpython-re

Regex to detect optional comment blocks followed by conditional blocks


I'm trying to write a script to detect and remove a #ifdef BUILD_FLAG ... #endif block from files with an optional comment block if it occurs right before it. So something like this will be removed

//this is s comment block
/**
*and a nested comment block
*/
//this is another comment block
#ifdef BUILD_FLAG
...
#endif

I'm trying to do it with this code

block_comment_pattern = r'/\*[\s\S]*?\*/|\/\/.*?$|\/\/.*?(\n|$)'
conditional_block_pattern = rf'#ifn?def\s+{re.escape(build_flag)}[\s\S])*?)\s*#endif'
pattern = rf'({block_comment_pattern})?{conditional_block_pattern}'
MATCHER = regex.compile(pattern, re.M)

However, it's only able to detect a part of the comment block and the conditional block

//this is another comment block
#ifdef BUILD_FLAG
...
#endif

When I tested the comment block pattern separately it was able to capture the whole comment block but not when combined with the conditional pattern. What is the better way/pattern to capture the example as mentioned above.

This is the demo of what I want to capture https://regex101.com/r/LLJV5i/1. In this demo, the whole comment block occurred right before the conditional block, both of them should be captured. (the comment block is optional, the conditional block is required)

Note:

  • Block comments are optional and in C++ style ( where '//' be preceded on the same line with code and '*/' could be on a line by itself)
  • There might be some spaces (indent) occurred before the block comments, and new line between the block comment and the conditional block (example 3)
  • Those block comments in my demo are examples of what I want to capture Examples for the block comments that needs to be captured.

Example 1:

//-------------------------------------------------------------------
// this whole comment block and the conditional block below should be captured
//-------------------------------------------------------------------
/**
 * @brief Some comment 
 *        
 *
 * @return 
 */
//-------------------------------------------------------------------
#ifdef BUILD_FLAG
...
#endif

Example 2

//---------------------------------------------------------
// this whole comment block and the conditional block below should be captured
// --------------------------------------------------------
#ifdef BUILD_FLAG
...
#endif

Example 3


    //----------------------------------------------------------------------
    /**
     * @comment: this whole block comment should be captured.
     * @{
     */

#ifdef BUILD_FLAG
...
#endif

Solution

  • You could replace matches of the following regular expression (with g and m flags set) with empty strings:

    (?:^ *(?:/\*\*[^/\n]*\r?\n(?:[^/\n]*\r?\n)*[^/\n]*\*/ *|//.*)\r?\n)+(?: *[\r?\n])*#ifdef BUILD_FLAG\r?\n[\s\S]*?^#endif\r?\n
    

    Demo

    The expression can be broken down as follows (as well, hover the cursor over each part of the expression at the link to obtain an explanation of its function).

    (?:                    # begin non-capture group
      ^                    # match beginning of line
      [ ]*                 # match >= 0 spaces
      (?:                  # begin non-capture group 
        /\*\*[^/\n]*\r?\n  # match '/**' followed by >= 0 chars other than '/' and
                           # newlines followed by the line terminator
        (?:                # begin a non-capture group
          [^/\n]*\r?\n     # match >= 0 chars other than '/' and newline then line term
        )                  # end capture group
        *                  # match the preceding non-capture group >= 0 times
        [^/\n]*            # match >= 0 chars other than '/' and newline
        \*/[ ]*            # match '*/' followed by >= 0 spaces
        |                  # or
        //.*               # match '//' followed by >= 0 chars other than line terms
      )                    # end non-capture group
      \r?\n                # match line terminator 
    )                      # end non-capture group
    +                      # match preceding non-capture group >= 1 times
    (?:[ ]*[\r?\n])*       # match >= 0 lines containing zero or more spaces
    #ifdef BUILD_FLAG\r?\n # match literal line
    [\s\S]*                # match >= 0 any chars
    ?                      # match as few preceding tokens as possible
    #endif\r?\n            # match literal at end of line
    

    Note: I've represented spaces as character classes containing a space ([ ]) to make them visible.