In my program, I scan files for certain text. After weeks of debugging I found out that certain lines of text in files give a segmentation fault depending on the regex pattern used. For example, I found the following line of text causes a segmentation fault
#include <glib.h>
int main()
{
GRegex* regex = g_regex_new("\\bhtml\\b", G_REGEX_CASELESS, G_REGEX_MATCH_NOTEMPTY, NULL);
//The following line causes a segmentation fault
g_regex_match(regex, "<code>USR1</code>) f\374hren. Ersteres ist ein schwerer Fehler,", G_REGEX_MATCH_NOTEMPTY, NULL);
return 0;
}
Whereas the following which uses a different pattern does not cause a segmentation fault
#include <glib.h>
int main()
{
GRegex* regex = g_regex_new("html", G_REGEX_CASELESS, G_REGEX_MATCH_NOTEMPTY, NULL);
g_regex_match(regex, "<code>USR1</code>) f\374hren. Ersteres ist ein schwerer Fehler,", G_REGEX_MATCH_NOTEMPTY, NULL);
return 0;
}
It is the combination of the regex pattern and the \374 in the string that causes the segmentation fault. I noticed however that if I manually escape \374 with \\374, no segmentation fault happens.
The source of that line of text comes from the file: https://httpd.apache.org/docs/2.4/de/stopping.html
For this specific case, when I read this file and store the line of text in a string, it gets stored in a string as \374 instead of getting stored as ü.
How can I fix this problem such as if I were to be reading thousands of files with hundreds of lines in them which could contain anything, I can use any regex pattern and avoid segmentation faults caused by this?
g_strescape helps in such cases, because it simply escapes '\' (and other special characters), which will prevent g_regex_match
from segfaulting.