Search code examples
c++regexglib

GLib regex match gives segmentation fault on specific matches and patterns


In my program, I scan files for certain text. After weeks of debugging I found out that certain lines of text in files give a segmentation fault depending on the regex pattern used. For example, I found the following line of text causes a segmentation fault

#include <glib.h>

int main()
{
    GRegex* regex = g_regex_new("\\bhtml\\b", G_REGEX_CASELESS, G_REGEX_MATCH_NOTEMPTY, NULL);
    //The following line causes a segmentation fault
    g_regex_match(regex, "<code>USR1</code>) f\374hren. Ersteres ist ein schwerer Fehler,", G_REGEX_MATCH_NOTEMPTY, NULL);
    return 0;
}

Whereas the following which uses a different pattern does not cause a segmentation fault

#include <glib.h>

int main()
{
    GRegex* regex = g_regex_new("html", G_REGEX_CASELESS, G_REGEX_MATCH_NOTEMPTY, NULL);
    g_regex_match(regex, "<code>USR1</code>) f\374hren. Ersteres ist ein schwerer Fehler,", G_REGEX_MATCH_NOTEMPTY, NULL);
    return 0;
}

It is the combination of the regex pattern and the \374 in the string that causes the segmentation fault. I noticed however that if I manually escape \374 with \\374, no segmentation fault happens.

The source of that line of text comes from the file: https://httpd.apache.org/docs/2.4/de/stopping.html

For this specific case, when I read this file and store the line of text in a string, it gets stored in a string as \374 instead of getting stored as ü.

How can I fix this problem such as if I were to be reading thousands of files with hundreds of lines in them which could contain anything, I can use any regex pattern and avoid segmentation faults caused by this?


Solution

  • g_strescape helps in such cases, because it simply escapes '\' (and other special characters), which will prevent g_regex_match from segfaulting.