Search code examples
regexperllookbehindvariable-length

Perl regex without variable length lookbehind?


I'm trying to hyperlink 400 or so keywords in a 50,000 word markdown document.

This is one of several steps in a Perl "build chain", so it would be ideal to achieve the hypelinking in Perl also.

I have a separate file contain all the keywords, and mapping each to a markdown fragment which it should be replaced with, like this:

keyword::(keyword)[#heading-to-jump-to]

The above example implies that wherever "keyword" occurs in the source markdown document, it should be replaced by the markdown fragment "(keyword)[#heading-to-jump-to]".

Ignoring keywords that occur as substrings of other keywords, plural/singular forms, and ambiguous keywords, it's reasonably straightforward. But naturally, there are two additional constraints.

I need to match only instances of keyword which are:

  • Not on a line not beginning #
  • Not most directly below The Heading To Jump To

The plain English meaning of these is: don't match keywords in any headings, and don't replace keywords that are under the heading they would link to.

My Perl script reads the $keyword::$link pairs and then, pair by pair, substitutes them into a regex, and then searches/replaces the document with that regex.

I've written a regex that does the matching (for the cases I've manually tested so far) using Regex Buddy's JGSoft regex implementation. It looks like this:

Frog::(Frog)[#the-frog)
-->    
([Ff]rog'?s?'?)(?=[\.!\?,;: ])(?<!#+ [\w ]*[Ff]rogs?)(?<!#+ the-frog)(?<!#+ the-frog[^#]*)

The problem (or, perhaps, a problem) with this it that it uses variable length lookbacks which are not supported by Perl. So I can't even test this regex on the full document to see if it really works.

I've read a bunch of other posts on how to work around variable length lookbacks, but I can't seem to get it right for my particular case. Can any of the resident regex wizards help out with a neater regex that will execute in Perl?


Solution

  • As I see it, your program will have three states:

    1. In a headline.
    2. In a paragraph directly after a headline.
    3. In other paragraphs.

    Because this roughly is a regular language, it can be parsed by regexes. But why would we want to do that, considering we would need 400 passes over the text?

    It might really be easier to split the file into an array of paragraphs. When we hit a headline, we produce all links that can point there. Then in the next paragraph, we substitute all keywords except the forbidden ones. E.g:

    my %substitutions = ...;
    my $kw_regex = ...;
    my %forbidden; # holds state
    
    local $/ = ""; # paragraph mode
    while (<>) {
      if (/^#/) {
        # it's a headline
        @forbidden{ slugify($_) } = ();  # extract forbidden link(s)
      } else {
        # a paragraph
        s{($kw_regex)}{
          my $keyword = $1;
          my $link = $substitutions{lc $keyword};
          exists $forbidden{$link} ? $keyword : "($keyword)[$link]";
        }eg;
        %forbidden = (); # forbidden links only in 1st paragraph after headline
      }
      print;
    }
    

    If headlines are not guaranteed to be seperated from their paragraphs by an empty line, then the paragrapg mode will not work, and you'll have to roll your own.

    Regexes are awesome, but they are not always an adequate tool.