Search code examples
gitposixgit-log

Does --pickaxe-regex really enable POSIX extended?


I'm disappointed by the --pickaxe-regex behavior in Git 2.43. The diffcore documentation claims the following (emphasis mine):

"-S<block of text>" detects filepairs whose preimage and postimage have different number of occurrences of the specified block of text. By definition, it will not detect in-file moves. Also, when a changeset moves a file wholesale without affecting the interesting string, diffcore-rename kicks in as usual, and -S omits the filepair (since the number of occurrences of that string didn’t change in that rename-detected filepair). When used with --pickaxe-regex, treat the <block of text> as an extended POSIX regular expression to match, instead of a literal string.

But that doesn't seem to be accurate. I've been comparing the output of these test commands in a Python repository:

# Success: just text in the search block
git log --pickaxe-regex -S'def'

# Broken on MacOS: POSIX extended word boundaries
git log --pickaxe-regex -S'\bdef\b'
git log --pickaxe-regex -S'\<def\>'
# But success on Git for Windows

The \b-wrapping doesn't require def to begin and end with word boundaries. It just breaks the search so that no results are returned. Just in case there is some confusion about escaping, I have also tried -S'\\bdef\\b', -S"\bdef\b", and -S"\\bdef\\b". None of them return results, but -S'def' does.

What's going on here?


Solution

  • \b is not defined by POSIX. It's an extension that is present in Perl, Ruby, and PCRE.

    From regex(7) on Linux:

    Regular expressions ("RE"s), as defined in POSIX.2, come in two forms: modern REs (roughly those of egrep; POSIX.2 calls these "extended" REs) and obsolete REs (roughly those of ed(1); POSIX.2 "basic" REs).

    POSIX extended REs support |, +, *, ?, ^, $ and bounds (with braces). They also support brackets with character classes.

    Again, from regex(7):

    Obsolete ("basic") regular expressions differ in several respects. '|', '+', and '?' are ordinary characters and there is no equivalent for their functionality. The delimiters for bounds are "{" and "}", with '{' and '}' by themselves ordinary characters. The parentheses for nested subexpressions are "(" and ")", with '(' and ')' by themselves ordinary characters. '^' is an ordinary character except at the beginning of the RE or(!) the beginning of a parenthesized subexpression, '$' is an ordinary character except at the end of the RE or(!) the end of a parenthesized subexpres‐ sion, and '*' is an ordinary character if it appears at the beginning of the RE or the beginning of a parenthesized subexpression (after a possible leading '^').

    All other escapes and functionality are basically defined by Perl or PCRE (including normal C escapes, like \t and \n). I don't believe there's an option to use PCRE in the pickaxe functionality, so you'll need to either send a patch or stick to extended POSIX regexes.