Search code examples
cgrepposixgnu

The type of algorithm behind different implementations of the regex.h header on POSIX systems


I recently read a fantastic article about regular expressions (see here) which explained the simplicity and elegance of the Thompson NFA for matching regular expressions, as well as the sheer complexity and slowness of other regular expressions. The diagram given shows Ruby, Python taking far longer than the Thompson NFA, in fact, the Thompson NFA had matching times similar to that of notoriously fast pattern matching UNIX programs, for example, grep and awk.

My question is, for the <regex.h> header (required by the POSIX standard) what algorithm does musl, glibc and uclibc use?

I looked at the source for the programs I compared the Thompson NFA to (GNU/Busybox grep, GNU/Busybox awk) and none of the implementations used the regex.h header, opting for their own in-house regex matching header.

NOTE: I do NOT mean regular expressions that conform to the POSIX standard - I am speaking specifically about the header supplied by POSIX-compliant (or UNIX-like) systems called regex.h, which supplies functions and structs used in compiling and matching regular expressions.


Solution

  • The GNU C library implementation – which is what you get when you use <regex.h> in Linux (and any other OS when compiling against the GNU C library) – is in glibc git: posix/regex.h, posix/regex.c, posix/regex_internal.h, posix/regex_internal.c, posix/regcomp.c, and posix/regexec.c. The internal files refer to DFA (deterministic finite automaton), and you can look at the above implementations to see that it indeed implements a DFA. The git history only reaches back to 2002 or so, but the base implementation is older than that; and I'm too lazy to go read old glibc-alpha mailing list archives to uncover the full development history details. There seem to have been some improvements (when syncing with gnulib regex implementations) in 2018.

    Suffice it to say, that for execution, even Thompson NFA is converted to a DFA, and that is what glibc uses to implement <regex.h> functionality.

    uClibc's <regex.h> implementation originates from glibc, roughly from the 2002-2006 era, based on the git log (and the copyright assignments in the source files). So DFA there, too.

    uClibc-ng is a fork of uClibc (forked around 2014-2015), and has basically the same <regex.h> implementation as uClibc.

    dietlibc uses its own custom (backtracking) implementation, libregex/rx.c, to provide <regex.h> functionality.

    musl C library has its own <regex.h> implementation also, based on Ville Laurikari's TRE (the approximate regex) library. I haven't gone through the code to classify it properly, but since its time complexity is linear wrt. the input string (but up to quadratic wrt. the regular expression), I do believe it too is based on finite automata.

    The author of Musl has compared various features of different C libraries, including simple regexes, here. Among other things, it shows that GNU C library implementation has the best runtime for regcomp()+regexec() among the C libraries tested, and all but dietlibc have roughly similar time behaviour.


    Why, then, would one implement their own regex functionality, and not rely on the one provided by the standard C library?

    The reasons I can see are portability to systems where the C library does not have regex functionality (consider e.g. GNU grep, sed, awk on Windows), and to ensure correctness and efficiency of the regex functionality.

    Personally, when I write POSIX C code (systems programming – daemons or utilities or libraries for use in Linux or *BSDs), I use <regex.h> as provided by the C library. If I were to write a new grep/sed/awk utility program, I would incorporate the regex library to the sources, so that the utility would be portable across OSes, including those that do not provide <regex.h> as part of their C library. I believe that is the reason you see some dedicated utilities incorporating their own regular expression libraries, too.