Search code examples
regexregex-lookarounds

Regular expression negative lookahead


In my home directory I have a folder drupal-6.14 that contains the Drupal platform.

From this directory I use the following command:

find drupal-6.14 -type f -iname '*' | grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*' | xargs tar -czf drupal-6.14.tar.gz

What this command does is gzips the folder drupal-6.14, excluding all subfolders of drupal-6.14/sites/ except sites/all and sites/default, which it includes.

My question is on the regular expression:

grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*'

The expression works to exclude all the folders I want excluded, but I don't quite understand why.

It is a common task using regular expressions to

Match all strings, except those that don't contain subpattern x. Or in other words, negating a subpattern.

I (think) I understand that the general strategy to solve these problems is the use of negative lookaheads, but I've never understood to a satisfactory level how positive and negative look(ahead/behind)s work.

Over the years, I've read many websites on them. The PHP and Python regex manuals, other pages like http://www.regular-expressions.info/lookaround.html and so forth, but I've never really had a solid understanding of them.

Could someone explain, how this is working, and perhaps provide some similar examples that would do similar things?

-- Update One:

Regarding Andomar's response: can a double negative lookahead be more succinctly expressed as a single positive lookahead statement:

i.e Is:

'drupal-6.14/(?!sites(?!/all|/default)).*'

equivalent to:

'drupal-6.14/(?=sites(?:/all|/default)).*'

???

-- Update Two:

As per @andomar and @alan moore - you can't interchange double negative lookahead for positive lookahead.


Solution

  • A negative lookahead says, at this position, the following regex must not match.

    Let's take a simplified example:

    a(?!b(?!c))
    
    a      Match: (?!b) succeeds
    ac     Match: (?!b) succeeds
    ab     No match: (?!b(?!c)) fails
    abe    No match: (?!b(?!c)) fails
    abc    Match: (?!b(?!c)) succeeds
    

    The last example is a double negation: it allows b followed by c. The nested negative lookahead becomes a positive lookahead: the c should be present.

    In each example, only the a is matched. The lookahead is only a condition, and does not add to the matched text.