Erroneous tokenizing

I have this code:

#include <boost/tokenizer.hpp>

typedef boost::tokenizer<boost::char_separator<char> > tokenizer;

int main() {
    using namespace std;
    boost::char_separator<char> sep(",");

    string s1 = "hello, world";
    tokenizer tok1(s1, sep);
    for (auto& token : tok1) {
        cout << token << " ";
    }
    cout << endl;

    tokenizer tok2(string("hello, world"), sep);
    for (auto& token : tok2) {
        cout << token << " ";
    }
    cout << endl;

    tokenizer tok3(string("hello, world, !!"), sep);
    for (auto& token : tok3) {
        cout << token << " ";
    }
    cout << endl;

    return 0;
}

This code produces the following result:

hello  world 
hello  
hello  world  !!

Obviously, the second line is wrong. I was expecting hello world instead. What is the problem?

Solution

The tokenizer does not create a copy of the string you pass as the first argument to its constructor, nor does it compute all the tokens upon construction and then cache them. Token extraction is performed in a lazy way, on demand.

However, in order for that to be possible, the object on which the token extraction is performed must stay alive as long as token are being extracted.

Here, the object from which tokens are to be extracted goes out of scope when the initialization of tok2 terminates (the same applies to tok3). This means you will get undefined behavior when the tokenizer object will try to use iterators into that string.

Notice, that tok3 is giving you the expected output purely by chance. The expected output is indeed one of the possible outputs of a program with undefined behavior.