Search code examples
cregexword-boundary

regexec in C does not match when \b is used in the expression


I am trying to use regular expressions in my C code to find a string in each line of a text file that I am reading and \b boundary seems like it does not work. That string can not be a part of a bigger string.

After that failure I also tried some hand-written boundary expression in the following and could not make it work in my code as well (source here):

(?i)(?<=^|[^a-z])MYWORDHERE(?=$|[^a-z])

But when I try something simple like a as the regular expression, it finds what is expected.

Here is my shortened snippet:

#include <regex.h>  
void readFromFile(char arr[], char * wordToSearch) {
  regex_t regex;
  int regexi;

  char regexStr [100];
  strcpy(regexStr, "\\b(");
  strcat(regexStr, wordToSearch);
  strcat(regexStr, ")\\b");

  regexi = regcomp(&regex, regexStr, 0);
  printf("regexi while compiling: %d\n", regexi);
  if (regexi) {
    fprintf(stderr, "compile error\n");
  }
  
  FILE* file = fopen(arr, "r");
  char line[256];
        
  while (fgets(line, sizeof(line), file)) {
    regexi = regexec(&regex, line, 0, NULL, 0);
    printf("%s\n", line);
    printf("regexi while execing: %d\n", regexi);
    if (!regexi) {
      printf("there is a match.");
    }
  }
  fclose(file);
}

In the regcomp function, I also tried to pass the REG_EXTENDED as the flag and it also did not work.


Solution

  • The regular expressions supported by POSIX are documented in the Linux regex(7) manual page and re_format(7) for MacOS X.

    Unfortunately the POSIX standard regular expressions (which come in 2 standard flavours: obsolete basic, and the REG_EXTENDED) support neither \b nor any of the (?...) formats, both of which I believe originated in Perl.

    Mac OS X (and possibly other BSD derived systems) additionally has the REG_ENHANCED format, which is not portable.

    Your best choice would be to use some other regular expression library such as PCRE. While the word boundaries themselves are a regular language, the use of capturing groups make this harder, as POSIX doesn't even support non-capturing grouping, otherwise you could use something like (^|[^[:alpha:])(.*)($|[^[:alpha:]]*) but it surely would get really messy.