Search code examples
c++arraysregexsplitline-breaks

Splitting strings separated by \r\n into array of strings [C/C++]


I have string containing e.g. "FirstWord\r\nSecondWord\r\nThird Word\n\r" and so on... I want to split it to string array using vector <string> so I would get:

FileName[0] == "FirstWord";
FileName[1] == "SecondWord"; 
FileName[2] == "Third Word";

Also, note the space in the third string.

This is what I've got so far:

string text = Files; // Files var contains the huge string of lines separated by \r\n
vector<string> FileName; // (optionaly) Here I want to store the result without \r\n

regex rx("[^\\s]+\r\n");
sregex_iterator FormatedFileList(text.begin(), text.end(), rx), rxend;

while(FormatedFileList != rxend)
{
    FileName.push_back(FormatedFileList->str().c_str());
    ++FormatedFileList;
}

It works, but when it comes to the third string which is "Third Word\r\n", it only gives me "Word\r\n".

Can anyone explain to me how do the regular expressions work? I'm a bit confused.


Solution

  • \s matches all spaces, including regular space, tab and a few others. You only want to exclude \r and \n, so your regex should be

    regex rx("[^\r\n]+\r\n");
    

    EDIT: This will not fit in a comment, and it will not be exhaustive -- regexes are a fairly complex topic, but I'll do my best to give a cursory explanation. All of this does make more sense if you grok formal languages, so I encourage you to read up on it, and there are countless regex tutorials on the net that go into more detail and that you should also read. Okay.

    Your code uses sregex_iterator to walk through all places in the string text where the regular expression rx matches, then turns them into strings and saves them. So, what are regular expressions?

    Regular expressions are a way of applying pattern matching to strings. This can range from simple substring searches to...well, to complex substring searches, really. Instead of just looking for an instance of "oba" in the string "foobar", for example, you might search for "oo" followed by any character followed by "a" and find it in "foobar" as well as in "foonarf".

    In order to enable this kind of pattern search, you must have a way to specify what pattern you are looking for, and one such way are regular expressions. The details vary across implementations, but in general it works by defining special characters that match special things or modify the behaviour of other parts of the pattern. This sounds confusing, so let's consider a few examples:

    • The period . matches any single character
    • Something followed by the Kleene star * matches zero ore more instances of that something
    • Something followed by a + will match one or more instances of that something
    • brackets [, ] enclose a set of characters; the whole thing then matches any one of those characters.
    • The caret ^ inverts the selection of a bracket expression

    Still confusing. So let's put it together:

    oo.a
    

    is a regular expression using the .. This will match "oo.a", "ooba", "oona", "oo|a" and anything else that is two o's followed by one character followed by an a. It will not match "ooa", "oba" or "nonsense".

    a*
    

    will match "", "a", "aa", "aaa", and any other sequence consisting only of a's but nothing else.

    [fgh]oobar
    

    will match any of "foobar", "goobar", and "hoobar", nothing else.

    [^fgh]oobar
    

    will match "aoobar", "boobar", "coobar" and so forth but not "foobar", "goobar" and "hoobar".

    [^fgh]+oobar
    

    will match "aoobar", "aboobar", "abcoobar", but not "oobar", "foobar", "agoobar", and "abhoobar".

    In your case,

    [^\r\n]+\r\n
    

    will match any instance of one or more characters that are neither \r nor \n followed by \r\n. You then iterate through all those matches and save the matched portions of text.

    That is about as deep as I believe I can reasonably go here. This rabbit hole is very deep, which means that you can do freaky cool stuff with regexes but that you should not expect to master them in a day or two. Most of it goes along the lines of what I just outlined, but in true programmer's fashion, most regex implementations go beyond the mathematical scope of regular languages and expressions and introduce useful but mindbendy stuff. Dragons be ahead, but the journey is worth it.