Search code examples
c++regexqtqregexp

Using QRegExp to parse headers


I am parsing an email header using QRegExp my problem is if the header tag is multiline my regex won't work.

Here is my regex: (I have \r\n has placeholders for now, )

QRegExp regex("([\\w-]+): (.+)\\r\\n(?:([^:]+)\\r\\n)?")
regex.setMinimal(true)
// PCRE: ([\w-]+): (.+?)\\r\\n(?:([^:]+?)\\r\\n)?

And what I'm trying to parse:

MIME-Version: 1.0\r\n
x-no-auto-attachment: 1\r\n
Received: by 10.200.36.132; Sun, 5 Feb 2017 01:21:33 -0800 (PST)\r\n
Date: Sun, 5 Feb 2017 01:21:33 -0800\r\n
Message-ID: <IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@mail.gmail.com>\r\n
Subject: =?UTF-8?Q?MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM?=\r\n
=?UTF-8?Q?ail?=\r\n
From: =?UTF-8?B?VGhlIGZ1Y2sgYXJlIHUgbG9va2luZyBmb3I/?= <noreply@mail.com>\r\n
To: mail mail <mail@mail.com>\r\n
Content-Type: multipart/alternative; boundary=1a3xca651sv561fd321c5xv61sd12\r\n

It works as expected for php, js... but not with QRegExp https://regex101.com/r/0J2jXT/2. I cannot get the second line of the tag Subject.

EDIT: What's weird is if I use std::regex from c++11 I get the right result! http://coliru.stacked-crooked.com/a/93494669f24422e1


Solution

  • QRegExp is an old class and should not be used anymore (except you are forced to work with Qt4...). If you can use Qt 5 and want better performance use QRegularExpression. With it your code works:

    QString data = "Message-ID: <IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@mail.gmail.com>\r\n"
                   "Subject: =?UTF-8?Q?MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM?=\r\n"
                   "=?UTF-8?Q?ail?=\r\n"
                   "From: =?UTF-8?B?VGhlIGZ1Y2sgYXJlIHUgbG9va2luZyBmb3I/?= <noreply@mail.com>\r\n";
    
    QRegularExpression rx("([\\w-]+): (.+)\\r\\n(?:([^:]+)\\r\\n)?");
    QRegularExpressionMatchIterator it = rx.globalMatch(data);
    while(it.hasNext()) {
        QRegularExpressionMatch match = it.next();
        qDebug() << match.capturedTexts();
    }
    

    outputs:

    ("Message-ID: <IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@mail.gmail.com>\r\n", "Message-ID", "<IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@mail.gmail.com>")
    ("Subject: =?UTF-8?Q?MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM?=\r\n=?UTF-8?Q?ail?=\r\n", "Subject", "=?UTF-8?Q?MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM?=", "=?UTF-8?Q?ail?=")
    ("From: =?UTF-8?B?VGhlIGZ1Y2sgYXJlIHUgbG9va2luZyBmb3I/?= <noreply@mail.com>\r\n", "From", "=?UTF-8?B?VGhlIGZ1Y2sgYXJlIHUgbG9va2luZyBmb3I/?= <noreply@mail.com>")