I want to extract tokens for a simple backtracking parser from a potentially long input code text. My setup is to use an integer cursor which holds the next-to-read position within the text, initially 0. Then I want to use substr
to extract simple short tokens and perlre for more complicated tokens. So the cursor position between subsequent regex searches might jump forward (after a successful token substring match) or backward (when backtracking).
My question is: How can I efficiently constrain the starting position for a perlregex search so that it searches for a matching token only from that position.
For example, I want to get the decimal number tokens in an exemplary text
my $text = 'long text with 2 numbers 3928 in it';
and current cursor position is 25. My current wisdom for this problem is either generating a (probably inefficiently long) substring
my $tail = substr $text, 25;
printf "%s\n",
$tail =~ /^\d+/
? "match: $&"
: "miss";
or manipulating the \G
modifier by a (probably inefficient) extra pattern match (note here that 25
would have to be a variable in the real tokenizer)
$text =~ /.{25}/gcm;
printf "%s\n",
$text =~ /\G\d+/
? "match: $&"
: "miss";
The latter alternative bears the additional cosmetic weakness that it is probably not thread safe. This is no issue in what I am doing right now, but I also highlight this concern within my question for those who might use multi-threading.
pos is an lvalue and can be assigned to.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw{ say };
my $text = 'long text with 2 numbers 3928 in it';
pos($text) = 25;
say for $text =~ /(\d+)/g;