Search code examples
regexperltemplate-meta-programming

Howto correct set the regex to substitute multiline variable placeholder with perl


I'm writing complex configuration files for a data transformation tool with Perl 5.20.

The configuration file has placeholders for several aspects at load time and runtime encapsulating some path code like

# Load time placeholder example
CONFIG: NAME ${/path/pos/*/test/*[123] == 'ABC' }

# Runtime placeholder example
COLUMN: CSV_NAME STRING DEFAULT :{./CSV_FIRST} 

For some reason it should work also on multi line expressions.

I've written scanner using Text::ParseWord, the standard separator \s+ and want to escape the placeholder expressions before splitting the data line(s) into single words by base64 encoded expressions not containing \s+. The expressions are also the keys for the subsequent data substitution.

The escaping is driven by a pattern match defined by (...see code below):

 my @pat = $line =~ /([^\\]\Q$pfx\E\{[^\Q$pfx\E\{\}]+\})/gs;

Which IMO defines a multi line pattern ${...} when I use $pfx = '$' for example, but masks (escapes) \${...} expressions.

Question

I'm struggling a while with the inner part of the pattern ... and get [^\Q$pfx\E\{\}]+ to work, but have the feeling that is not the correct one, because

  1. it includes only the symbols set not to use,
  2. but not the sequence of the outer part,
  3. to prevent nested expressions for example.

What is the right expression to do this?

Test routine

#!/usr/bin/env perl
use strict;
use warnings;
use MIME::Base64;

use feature qw(signatures);
no warnings 'once';
no warnings 'experimental';
no warnings 'experimental::signatures';

my $line =
'# test data
 This are:
 1. ${/multiline/used/*[3]
      = "12345"}
 2. ${/single/line/compile/time/pattern/*[3]}
 3. ${/single/line/runtime/pattern/x == 1234}
 4. ${/multi/line/runtime/pattern \
      defer/1 \
      defer/2 \
      defer/3
     }
 5. ${//PG.GRM/*[
        key eq "TEST.VAR"
       ]}
';

sub testSpacedPlaceHolder($pfx, $line) {
    my %match;
    
    my @pat = $line =~ /([^\\]\Q$pfx\E\{[^\Q$pfx\E\{\}]+\})/gs;
    my %seen = ();
    my @uniq = grep { ! $seen{$_} ++ } @pat;    
    for my $key (@uniq) {
        my $hkey=$pfx.encode_base64($key);
        $hkey =~ s/\n//g;
        my $var = substr($key, 3, -1);
        $match{$hkey}= [ $var, $key ];
        $line =~ s/\Q$key\E/$hkey/g;
    }
    # Test the output ------------------------------
    my $cnt = 0;
    print "\nRESULT:\n";
    for my $key (sort keys %match) {
        $cnt++;
        my ($var, $orig) = @ { $match{$key} };
        print "---- $cnt ----\n";
        print "ORG: $orig\n";
        print "VAR: $var\n";
        print "ESC: $key\n";

    }
    print "\nLINE:\n$line\n";
    return ($line, \%match);
}

testSpacedPlaceHolder('$', $line);

Result

/usr/bin/env perl "test-strings.pl"

RESULT:
---- 1 ----
ORG:  ${/multi/line/runtime/pattern \
      defer/1 \
      defer/2 \
      defer/3
     }
VAR: /multi/line/runtime/pattern \
      defer/1 \
      defer/2 \
      defer/3
     
ESC: $ICR7L211bHRpL2xpbmUvcnVudGltZS9wYXR0ZXJuIFwKICAgICAgZGVmZXIvMSBcCiAgICAgIGRlZmVyLzIgXAogICAgICBkZWZlci8zCiAgICAgfQ==
---- 2 ----
ORG:  ${/multiline/used/*[3]
      = "12345"}
VAR: /multiline/used/*[3]
      = "12345"
ESC: $ICR7L211bHRpbGluZS91c2VkLypbM10KICAgICAgPSAiMTIzNDUifQ==
---- 3 ----
ORG:  ${/single/line/compile/time/pattern/*[3]}
VAR: /single/line/compile/time/pattern/*[3]
ESC: $ICR7L3NpbmdsZS9saW5lL2NvbXBpbGUvdGltZS9wYXR0ZXJuLypbM119
---- 4 ----
ORG:  ${/single/line/runtime/pattern/x == 1234}
VAR: /single/line/runtime/pattern/x == 1234
ESC: $ICR7L3NpbmdsZS9saW5lL3J1bnRpbWUvcGF0dGVybi94ID09IDEyMzR9
---- 5 ----
ORG:  ${//PG.GRM/*[
        key eq "TEST.VAR"
       ]}
VAR: //PG.GRM/*[
        key eq "TEST.VAR"
       ]
ESC: $ICR7Ly9QRy5HUk0vKlsKICAgICAgICBrZXkgZXEgIlRFU1QuVkFSIgogICAgICAgXX0=

LINE:
# test data
 This are:
 1.$ICR7L211bHRpbGluZS91c2VkLypbM10KICAgICAgPSAiMTIzNDUifQ==
 2.$ICR7L3NpbmdsZS9saW5lL2NvbXBpbGUvdGltZS9wYXR0ZXJuLypbM119
 3.$ICR7L3NpbmdsZS9saW5lL3J1bnRpbWUvcGF0dGVybi94ID09IDEyMzR9
 4.$ICR7L211bHRpL2xpbmUvcnVudGltZS9wYXR0ZXJuIFwKICAgICAgZGVmZXIvMSBcCiAgICAgIGRlZmVyLzIgXAogICAgICBkZWZlci8zCiAgICAgfQ==
 5.$ICR7Ly9QRy5HUk0vKlsKICAgICAgICBrZXkgZXEgIlRFU1QuVkFSIgogICAgICAgXX0=

Edit

Assuming I have a script that defines some kind of configuration:

MAGIC: MAGIC.TYPE

CONTAINER: NAME BEGIN

DEFINE: VAR1 'USER.NAME'
DEFINE: VAR2 '65789'

INTERNAL.CONTAINER: INTERNAL.NAME BEGIN

    TAG1: 'ABCDEF'
    TAG2: ${/NAME/VAR1}

    # Unwanted nested variant 
    TAG3: ${/NAME/VAR1 ${/NAME/VAR2} }

    # Valid runtime interpolation variant
    TAG4: "${/NAME/VAR1}/:{NAME.KEY}"
    
    # Valid runtime path variant but ignored
    TAG5: ${/NAME/VAR1/*/:{TEST{KEY}}

END.INTERNAL.NAME
 
END.NAME 

I want to avoid the nested line

    # Nested variant 
    TAG3: ${/NAME/VAR1 ${/NAME/VAR2}}

for variable resolve reasons, but keep

    # Valid runtime path variant but ignored
    TAG5: ${/NAME/VAR2/*/:{TEST{KEY}}

because they are runtime driven.

My variant blocks TAG5 due to the simple sequence [\$\{\}]+.


Solution

  • Here is an example of how you can use a recursive regex to exclude nested versions of ${...}:

    use feature qw(say);
    use strict;
    use warnings;
    use Data::Dumper qw(Dumper);
    my $str = <<'END_STR';
    MAGIC: MAGIC.TYPE
    
    CONTAINER: NAME BEGIN
    
    DEFINE: VAR1 'USER.NAME'
    DEFINE: VAR2 '65789'
    
    INTERNAL.CONTAINER: INTERNAL.NAME BEGIN
    
        TAG1: 'ABCDEF'
        TAG2: ${/NAME/VAR1}
    
        # Unwanted nested variant
        TAG3: ${/NAME/VAR1 ${/NAME/VAR2} }
    
        # Valid runtime interpolation variant
        TAG4: "${/NAME/VAR1}/:{NAME.KEY}"
    
        # Valid runtime path variant but ignored
        TAG5: ${/NAME/VAR1/*/:{TEST{KEY}}}
    
    END.INTERNAL.NAME
    
    END.NAME
    END_STR
    
    my @matches;
    while ($str =~ /(?:^|(?<!\\))
                       (?<G3>(?<G1> \$ \{ (?:
                           (?>(?:[^{}\$\\] | (?:\\.) |
                             (?<G2> \{ (?: (?>[^{}]+) | (?&G2))* \} ))
                               | (?<G4>(?&G1))))* \}))/msxg) {
        next if defined $+{G4}; # skip nested matches
        push @matches, $+{G3};
    }
    print Dumper(\@matches);
    

    Output:

    $VAR1 = [
              '${/NAME/VAR1}',
              '${/NAME/VAR1}',
              '${/NAME/VAR1/*/:{TEST{KEY}}}'
            ];
    

    Notice that the results include TAG2, TAG4, and TAG5, but not TAG3.