Search code examples
regexperlconditional-statementsmatchgreedy

How to do conditional greedy match in Perl?


I want Perl to parse a code text and identify certain stuffs, example code:

use strict;
use warnings;

$/ = undef;

while (<DATA>) {
  s/(\w+)(\s*<=.*?;)/$1_yes$2/gs;
  print;
}

__DATA__
always @(posedge clk or negedge rst_n)
if(!rst_n)begin
        d1 <= 0; //perl_comment_4
        //perl_comment_5
        d2 <= 1  //perl_comment_6
                 + 2;
        end
else if( d3 <= d4 && ( d5 <= 3 ) ) begin
        d6 <= d7 +
                 (d8 <= d9 ? 1 : 0);
        //perl_comment_7
        d10 <= d11 <=
                      d12
                        + d13
                            <= d14 ? 1 : 0;
        end

Match target is something that meets all of the following:

(1) It begins with the combination word\s*<=. Here \s* maybe 0 or more spaces, newlines, tabs.

(2) The aforementioned "combination" should be out of any pair of ( and ).

(3) If multiple "combinations" appear consecutively, then take the first one as the beginning. (Something like "greedy" matching at the left boundary)

(4) it ends with the first ; after the "combination" mentioned in (1).

There may be word\s*<= and ; in code comments (there may be anything in comments); this makes things more complicated. To make life easier, I already pre-processed the text, scanning for comments and replacing them with stuff like //perl_comment_6. (This solution seems rather cumbersome and stupid. Any smarter, more elegant solutions?)

What I wanna do:

For all matched word\s*<=, replace word with word_yes. For the example code, d1, d2, d6 and d10 should be replaced by d1_yes, d2_yes, d6_yes and d10_yes, respectively, and all other parts of the text should remain unchanged.

In my current code I use s/(\w+)(\s*<=.*?;)/$1_yes$2/gs;, which correctly recognizes d1, d2 and d10, but fails to recognize d6 and mistakenly recognizes d3.

Any suggestions? Thanks in advance~


Solution

  • This is a lot more complicated that you might imagine, and it is impossible to do properly without writing a parser for the language you are trying to process. However, you may be in luck if your sample is a consistently limited subset of the language

    The best way I can see to do this is to use split to separate out all the subsections of the string that are in parentheses from the "top level" sections where the replacements are to be done. Then the changes can be made to the relevant parts and the split sections joined back together

    Even this relies on the code having properly balanced parentheses, and an odd open or closing parenthesis that appears in, say, a string or a comment will throw the process out. The regex used in the split has to be recursive so that nested parentheses can be matched, and making it a capturing regex makes split returns all of the parts of the string instead of just the sections between the matches

    This code will do as you ask, but beware that, as I described, it is extremely fragile

    use strict;
    use warnings;
    
    my $data = do {
        local $/;
        <DATA>;
    };
    
    my @split = split / ( \( (?> [^()] | (?1) )* \) ) /x, $data;
    
    for ( @split ) {
        next if /[()]/;
        s/ ^ \s* \w+ \K (?= \s* <= ) /_yes/xgm;
    }
    
    print join '', @split;
    
    
    __DATA__
    always @(posedge clk or negedge rst_n)
    if(!rst_n)begin
            d1 <= 0; //perl_comment_4
            //perl_comment_5
            d2 <= 1  //perl_comment_6
                     + 2;
            end
    else if( d3 <= d4 && ( d5 <= 3 ) ) begin
            d6 <= d7 +
                     (d8 <= d9 ? 1 : 0);
            //perl_comment_7
            d10 <= d11 <=
                          d12
                            + d13
                                <= d14 ? 1 : 0;
            end
    

    output

    always @(posedge clk or negedge rst_n)
    if(!rst_n)begin
            d1_yes <= 0; //perl_comment_4
            //perl_comment_5
            d2_yes <= 1  //perl_comment_6
                     + 2;
            end
    else if( d3 <= d4 && ( d5 <= 3 ) ) begin
            d6_yes <= d7 +
                     (d8 <= d9 ? 1 : 0);
            //perl_comment_7
            d10_yes <= d11 <=
                          d12
                            + d13
                                <= d14 ? 1 : 0;
            end