Search code examples
c++stringfor-loopvectortokenize

Output of string in for loop changes depending on whether a certain condition is met


I've been trying to tokenize a string in C++. I have a for loop that I'm using to iterate over the string, as seen below:

for(unsigned int i=0; i < data_str.length(); i++)
{
    tok += data_str[i];
    if(tok[i] == '\n')
    {
        //cout << "NEWLINE" << endl;
        tok = "";
    }
    if(tok == "output:")
    {
        cout << "OUTPUT FOUND" << endl;
        tokens.push_back("output:");
        tok = "";
    }
    cout << tok << endl;
}

As you can see I'm resetting the tok variable when a certain token ("output:") is found. I then add a string to a vector I'm using to store my tokens, called tokens. To see if my tokens vector had the appropriate amount of strings, I printed it out. I expected two strings, each saying "output:". However, when the vector was printed, it only had one string. I did a little debugging, and found that whenever I try to reset the tok variable after finding the token "output:" the loop only finds one occurrence of the string "output:". I then decided to print the tok variable, and got the following output:

o
ou
out
outp
outpu
output
OUTPUT FOUND

"
"H
"He
"Hel
"Hell
"Hello
"Hello
"Hello W
"Hello Wo
"Hello Wor
"Hello Worl
"Hello World
"Hello World"
"Hello World"

"Hello World"
o
"Hello World"
ou
"Hello World"
out
"Hello World"
outp
"Hello World"
outpu
"Hello World"
output
"Hello World"
output:
"Hello World"
output:"
"Hello World"
output:"G
"Hello World"
output:"Go
"Hello World"
output:"Goo
"Hello World"
output:"Good
"Hello World"
output:"Goody
"Hello World"
output:"Goodye
"Hello World"
output:"Goodye
"Hello World"
output:"Goodye W
"Hello World"
output:"Goodye Wo
"Hello World"
output:"Goodye Wor
"Hello World"
output:"Goodye Worl
"Hello World"
output:"Goodye World
"Hello World"
output:"Goodye World"
output:string

When I commented out the line that reset the tok variable I got:

o
ou
out
outp
outpu
output
OUTPUT FOUND
output:
output:"
output:"H
output:"He
output:"Hel
output:"Hell
output:"Hello
output:"Hello
output:"Hello W
output:"Hello Wo
output:"Hello Wor
output:"Hello Worl
output:"Hello World
output:"Hello World"

o
ou
out
outp
outpu
output
OUTPUT FOUND
output:
output:"
output:"G
output:"Go
output:"Goo
output:"Good
output:"Goody
output:"Goodye
output:"Goodye
output:"Goodye W
output:"Goodye Wo
output:"Goodye Wor
output:"Goodye Worl
output:"Goodye World
output:"Goodye World"
output:string
output:string

TWhy does my loop correctly work only when I'm not trying to reset the tok variable? I have to reset the variable, otherwise other parts of my program wouldn't work. Is there an alternative solution for resetting my tok variable?


Solution

  • It is obvious that tok is a std::string, so:

    for(unsigned int i=0; i < data_str.length(); i++)
        {
            tok += data_str[i];
            if(tok[i] == '\n')
            {
                //cout << "NEWLINE" << endl;
                tok = "";
            }
    

    Let's use paper and pencil, and follow along just this part of the parsing algorithm. Assuming that data_str consists of the following text:

    "hello\nworld"
    

    After data_str[5] gets appended to tok, so that tok now contains "hello\n", since tok[5] is '\n', tok gets cleared to an empty string.

    On the next iteration, data_str[6] gets appended to an empty tok, so tok now contains just a "w" (since it was cleared on the previous iteration of the loop).

     if(tok[i] == '\n')
    

    i is now 6. This checks tok[6]. Of course, tok has only one character. This results in undefined behavior, and meaningless result.

    Things go pretty much off the rails, from this point forward.

    If the intent here is to clear the tok buffer after every newline, check the last character of tok, which would be tok[tok.size()-1], instead of tok[i], since i and the size of tok have absolutely nothing to do with each other, whatsoever.