Search code examples
regexperl

How can I match and modify C and C++ comments with Perl?


I have the task of (trying to) do a search and replace within a large codebase for a word suffix, only when it occurs within comments. All of the comments are of the /* or // type but they are guaranteed to include most of the edge cases imaginable.

So I want to change this:

/* blah blah something__suffix blah */

to this:

/* blah blah something blah */

but I also want to change this:

// blah blah something__suffix blah 

to this:

// blah blah something blah 

And this:

/*
 * blah blah something__suffix blah 
 */

to this:

/*
 * blah blah something blah 
 */

And this:

/** 

// blah blah something__suffix blah 

*/

To this:

/** 

// blah blah something blah 

*/

ad nauseam (literally).

Initially I felt that this was a parser task and I installed cochinelle, and indeed it could parse my comments but it got stuck with my preprocessor macros and the workarounds seemed complex for someone who just has this as a one-off task. So now I'm considering regex.

I haven't found a lot of advice around about doing really robust search and replace within C and C++ comments with regex (besides "you need a parser"), but I did notice that there seems to be a pretty well road-tested perl script on the perl FAQ for removing comments in both of these styles here.

as follows:

$/ = undef;
$_ = <>;

s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;

print;

My question: how to adapt this script so that instead of stripping the comment, the text that has been identified as a comment can then be searched for the suffix and the suffix removed, leaving the rest of the comment intact?


Solution

  • You need to do it in two steps because you might have

    /* foo__suffix bar__suffix */
    

    First, extract the comment, then substitute any __suffix in the comment.

    s{
       \G
       (?:(?!/[*/]).)*
       \K
       (   /[*] (?:(?![*]/).)* [*]/
       |   //   [^\n]*
       )
    }{
       my $comment = $1;
       $comment =~ s/(?<=\w)__suffix//g;
       $comment
    }xes;
    

    Notes:

    • (?:(?!STRING).) is to (?:STRING) as [^CHAR] is to CHAR.

    • My solution will mess up if you have // or /* in a string literal.

    • If you're ok with removing instances of __suffix that aren't preceded by an identifier, you can remove the (?<=\w).

    • If you're using 5.14 or higher, you can simplify

      s{...}{
         my $comment = $1;
         $comment =~ s/(?<=\w)__suffix//g;
         $comment
      }xes;
      

      to

      s{...}{
         $1 =~ s/(?<=\w)__suffix//rg
      }xes;