Search code examples
regexperformanceperltokenize

set perl regular expression search start at a given position in a long string


I want to extract tokens for a simple backtracking parser from a potentially long input code text. My setup is to use an integer cursor which holds the next-to-read position within the text, initially 0. Then I want to use substr to extract simple short tokens and perlre for more complicated tokens. So the cursor position between subsequent regex searches might jump forward (after a successful token substring match) or backward (when backtracking).

My question is: How can I efficiently constrain the starting position for a perlregex search so that it searches for a matching token only from that position.

For example, I want to get the decimal number tokens in an exemplary text

my $text = 'long text with 2 numbers 3928 in it';

and current cursor position is 25. My current wisdom for this problem is either generating a (probably inefficiently long) substring

my $tail = substr $text, 25;
printf "%s\n",
    $tail =~ /^\d+/
    ? "match: $&"
    : "miss";

or manipulating the \G modifier by a (probably inefficient) extra pattern match (note here that 25 would have to be a variable in the real tokenizer)

$text =~ /.{25}/gcm;
printf "%s\n",
    $text =~ /\G\d+/
    ? "match: $&"
    : "miss";

The latter alternative bears the additional cosmetic weakness that it is probably not thread safe. This is no issue in what I am doing right now, but I also highlight this concern within my question for those who might use multi-threading.


Solution

  • pos is an lvalue and can be assigned to.

    #!/usr/bin/perl
    use strict;
    use warnings;
    use feature qw{ say };
    
    my $text = 'long text with 2 numbers 3928 in it';
    
    pos($text) = 25;
    
    say for $text =~ /(\d+)/g;