Search code examples
cregex

how to escape hyphen in regex character group in c


I have a c program to compile a regex, which looks something like:

   regex_t re;

   if (regcomp(&re, "^[a-z0-9\\-#_]+$",REG_EXTENDED) != 0) {
      printf("Error compiling %s\n");
   }

basically, I want to match anything with a sequence of lowercase characters, numbers, dashs, hashs or underscores. It seems though that the above fails due to the dash:

Error compiling ^[a-zA-Z0-9\-#_]+$

According to all the documentation I can find, in posix extended regex's, you should be able to escape the - in a character group, but for some reason this does not seem to work on my trials. I also tried double escaping ("^[a-z0-9\\\\-#_]+$"), which yields the same result. I understand that I can put the dash at the end of the character group without escaping at all, but I'm wondering how to properly escape if it's in the middle of the character group.


Solution

  • According to all the documentation I can find, in posix extended regex's, you should be able to escape the - in a character group...

    This is not correct. From POSIX 9.3.5 RE Bracket Expression...

    The special characters '.', '*', '[', and '\' ( , , , and , respectively) shall lose their special meaning within a bracket expression.


    I understand that I can put the dash at the end of the character group without escaping at all, but I'm wondering how to properly escape if it's in the middle of the character group.

    There isn't. You have to play with the parsing rules, as explained in the same document.

    The character shall be treated as itself if it occurs first (after an initial '^', if any) or last in the list, or as an ending range point in a range expression. As examples, the expressions "[-ac]" and "[ac-]" are equivalent and match any of the characters 'a', 'c', or '-'; "[^-ac]" and "[^ac-]" are equivalent and match any characters except 'a', 'c', or '-'; the expression "[%--]" matches any of the characters between '%' and '-' inclusive; the expression "[--@]" matches any of the characters between '-' and '@' inclusive; and the expression "[a--@]" is either invalid or equivalent to '@', because the letter 'a' follows the symbol '-' in the POSIX locale. To use a as the starting range point, it shall either come first in the bracket expression or be specified as a collating symbol; for example, "[][.-.]-0]", which matches either a or any character or collating element that collates between and 0, inclusive.

    If a bracket expression specifies both '-' and ']', the ']' shall be placed first (after the '^', if any) and the '-' last within the bracket expression.

    What a nightmare. It's simplest to just put the dash at the front or back.

    POSIX regexes are pretty crude. Consider pcre or GRegex instead for anything serious.