Search code examples
c++regexpcre

Unable to match whole string using PCRE regex in C++


This regular expression '(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+ works as expected to match Ġmeousrtr, this can be seen in the shared link https://regex101.com/r/UR0P6T/1

But when I try using the PCRE library in C++, I get 3 individual matches instead of 1. I get that Unicode character Ġ is 2 byte width and the expression is matching for the two bytes, but shouldn't this match the whole string as https://regex101.com/r/UR0P6T/1

# Output of regex expression
'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

# Matches
Match Succeeded at 0
�x
Match Succeeded at 1
�x
Match Succeeded at 2
meousrtrx

Below is the C++ code:

#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>
#include <string.h>
#include <iostream>
using namespace std;

int main(int argc, char **argv)
{
    PCRE2_SPTR expression = (PCRE2_SPTR) "'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+";
    PCRE2_SPTR text = (PCRE2_SPTR) "Ġmeousrtr";
    PCRE2_SIZE eoffset;
    PCRE2_SIZE *ovector;
    pcre2_code *re;
    pcre2_match_data *match_data;
    char *c = (char *)expression;
    while (*c)
        printf("%c", (unsigned int)*c++);
    printf("\n");

    int error_number;
    int result;
    size_t start_offset = 0;
    size_t text_len;
    u_int32_t options = 0;

    text_len = strlen((char *)text);

    re = pcre2_compile(expression, PCRE2_ZERO_TERMINATED, 0, &error_number, &eoffset, NULL);
    if (re == NULL)
    {
        PCRE2_UCHAR buffer[256];
        pcre2_get_error_message(error_number, buffer, sizeof(buffer));
        cout << buffer;
        return 1;
    }
    match_data = pcre2_match_data_create_from_pattern(re, NULL);
    while (true)
    {
        result = pcre2_match(re, text, text_len, start_offset, options, match_data, NULL);
        if (result < 0)
        {
            switch (result)
            {
            case PCRE2_ERROR_NOMATCH:
                cout << "No matches found!";
                return 0;

            default:
                cout << "Matching Error" << result;
                return -1;
            }
            pcre2_match_data_free(match_data);
            pcre2_code_free(re);
        }
        ovector = pcre2_get_ovector_pointer(match_data);
        printf("Match Succeeded at %d\n", ovector[0]);
        int i;
        for (i = 0; i < result; i++)
        {
            PCRE2_SPTR substring_start = text + ovector[2 * i];
            PCRE2_SIZE substring_length = ovector[2 * i + 1] - ovector[2 * i];
            printf("%.*s\n", (int)substring_length, (char *)substring_start);
        }
        start_offset = ovector[1];
    }
}

Solution

  • You need to use PCRE2_UTF in the pcre2_compile() options for it to recognize UTF-8 encoded text (Which is presumably what your source file is encoded as):

    re = pcre2_compile(expression, PCRE2_ZERO_TERMINATED, PCRE2_UTF,
                       &error_number, &eoffset, nullptr);
    

    Your code has other issues - the printf() format for size_t values is %zu, not %d, for example (It'd be better to just use C++ style iostream output functions consistently throughout, instead of mixing and matching between iostreams and stdio), for example, but telling PCRE2 its input is Unicode is the big relevant one. That change makes your program output

    '(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
    Match Succeeded at 0
    Ġmeousrtr
    No matches found!