I have string containing e.g. "FirstWord\r\nSecondWord\r\nThird Word\n\r"
and so on...
I want to split it to string array using vector <string>
so I would get:
FileName[0] == "FirstWord";
FileName[1] == "SecondWord";
FileName[2] == "Third Word";
Also, note the space in the third string.
This is what I've got so far:
string text = Files; // Files var contains the huge string of lines separated by \r\n
vector<string> FileName; // (optionaly) Here I want to store the result without \r\n
regex rx("[^\\s]+\r\n");
sregex_iterator FormatedFileList(text.begin(), text.end(), rx), rxend;
while(FormatedFileList != rxend)
{
FileName.push_back(FormatedFileList->str().c_str());
++FormatedFileList;
}
It works, but when it comes to the third string which is "Third Word\r\n"
, it only gives me "Word\r\n"
.
Can anyone explain to me how do the regular expressions work? I'm a bit confused.
\s
matches all spaces, including regular space, tab and a few others. You only want to exclude \r
and \n
, so your regex should be
regex rx("[^\r\n]+\r\n");
EDIT: This will not fit in a comment, and it will not be exhaustive -- regexes are a fairly complex topic, but I'll do my best to give a cursory explanation. All of this does make more sense if you grok formal languages, so I encourage you to read up on it, and there are countless regex tutorials on the net that go into more detail and that you should also read. Okay.
Your code uses sregex_iterator
to walk through all places in the string text
where the regular expression rx
matches, then turns them into strings and saves them. So, what are regular expressions?
Regular expressions are a way of applying pattern matching to strings. This can range from simple substring searches to...well, to complex substring searches, really. Instead of just looking for an instance of "oba"
in the string "foobar"
, for example, you might search for "oo"
followed by any character followed by "a"
and find it in "foobar"
as well as in "foonarf"
.
In order to enable this kind of pattern search, you must have a way to specify what pattern you are looking for, and one such way are regular expressions. The details vary across implementations, but in general it works by defining special characters that match special things or modify the behaviour of other parts of the pattern. This sounds confusing, so let's consider a few examples:
.
matches any single character*
matches zero ore more instances of that something+
will match one or more instances of that something[
, ]
enclose a set of characters; the whole thing then matches any one of those characters.^
inverts the selection of a bracket expressionStill confusing. So let's put it together:
oo.a
is a regular expression using the .
. This will match "oo.a", "ooba", "oona", "oo|a" and anything else that is two o's followed by one character followed by an a. It will not match "ooa", "oba" or "nonsense".
a*
will match "", "a", "aa", "aaa", and any other sequence consisting only of a's but nothing else.
[fgh]oobar
will match any of "foobar", "goobar", and "hoobar", nothing else.
[^fgh]oobar
will match "aoobar", "boobar", "coobar" and so forth but not "foobar", "goobar" and "hoobar".
[^fgh]+oobar
will match "aoobar", "aboobar", "abcoobar", but not "oobar", "foobar", "agoobar", and "abhoobar".
In your case,
[^\r\n]+\r\n
will match any instance of one or more characters that are neither \r
nor \n
followed by \r\n
. You then iterate through all those matches and save the matched portions of text
.
That is about as deep as I believe I can reasonably go here. This rabbit hole is very deep, which means that you can do freaky cool stuff with regexes but that you should not expect to master them in a day or two. Most of it goes along the lines of what I just outlined, but in true programmer's fashion, most regex implementations go beyond the mathematical scope of regular languages and expressions and introduce useful but mindbendy stuff. Dragons be ahead, but the journey is worth it.