Search code examples
regexstringtextstring-matchingpython-re

Can all string matching tasks be accomplished by only using the most basic syntax of regular expressions?


By 'basic', I mean not using these regex syntax / features:

  • non-capturing groups (?:pattern)
  • look ahead positive assert (?=pattern)
  • negative assert (?!pattern)
  • look behind (?<=pattern)
  • ... and so on

If I only use "capturing groups" along with those meta-characters (e.g. \s, \d), can I get all the job done?


Solution

  • Suppose we wish to match a string that begins 'dog' and ends 'cat' that does not contain 'pig'. We can do that with a regular expression that employs the tempered greedy token technique, which employs a negative lookahead:

    dog(?:(?!pig).)*cat
    

    Demo.

    While it is a logical impossibility to prove that this task cannot be done with a regular expression that does not contain a lookaround, I am not aware of such an expression. Readers are invited to suggest one that does.


    This regular expression can be broken down as follows.

    dog      # match 'dog'
    (?:      # begin a non-capture group
      (?!    # begin a negative lookahead
        pig  # match 'pig'
      )      # end the negative lookahead
      .      # match any character other than a line terminator
    )        # end the non-capture group
    *        # execute the preceding non-capture group zero or more times
    cat      # match 'cat'
    

    Come to think of it, one could use an alternation with all permutations of all numbers (up to the the string size less 6) of all permissible characters that forms a string that does not contain 'pig': dogcat|dogacat|dogbcat| ... |dogaabcat|dogabacat| ... |dogaZrnpiht0xcat|... but with longer strings this could be somewhat demanding computationally.