I've been trying to tokenize a string in C++.
I have a for
loop that I'm using to iterate over the string, as seen below:
for(unsigned int i=0; i < data_str.length(); i++)
{
tok += data_str[i];
if(tok[i] == '\n')
{
//cout << "NEWLINE" << endl;
tok = "";
}
if(tok == "output:")
{
cout << "OUTPUT FOUND" << endl;
tokens.push_back("output:");
tok = "";
}
cout << tok << endl;
}
As you can see I'm resetting the tok
variable when a certain token ("output:") is found. I then add a string to a vector I'm using to store my tokens, called tokens
. To see if my tokens vector had the appropriate amount of strings, I printed it out. I expected two strings, each saying "output:". However, when the vector was printed, it only had one string. I did a little debugging, and found that whenever I try to reset the tok
variable after finding the token "output:" the loop only finds one occurrence of the string "output:". I then decided to print the tok
variable, and got the following output:
o
ou
out
outp
outpu
output
OUTPUT FOUND
"
"H
"He
"Hel
"Hell
"Hello
"Hello
"Hello W
"Hello Wo
"Hello Wor
"Hello Worl
"Hello World
"Hello World"
"Hello World"
"Hello World"
o
"Hello World"
ou
"Hello World"
out
"Hello World"
outp
"Hello World"
outpu
"Hello World"
output
"Hello World"
output:
"Hello World"
output:"
"Hello World"
output:"G
"Hello World"
output:"Go
"Hello World"
output:"Goo
"Hello World"
output:"Good
"Hello World"
output:"Goody
"Hello World"
output:"Goodye
"Hello World"
output:"Goodye
"Hello World"
output:"Goodye W
"Hello World"
output:"Goodye Wo
"Hello World"
output:"Goodye Wor
"Hello World"
output:"Goodye Worl
"Hello World"
output:"Goodye World
"Hello World"
output:"Goodye World"
output:string
When I commented out the line that reset the tok
variable I got:
o
ou
out
outp
outpu
output
OUTPUT FOUND
output:
output:"
output:"H
output:"He
output:"Hel
output:"Hell
output:"Hello
output:"Hello
output:"Hello W
output:"Hello Wo
output:"Hello Wor
output:"Hello Worl
output:"Hello World
output:"Hello World"
o
ou
out
outp
outpu
output
OUTPUT FOUND
output:
output:"
output:"G
output:"Go
output:"Goo
output:"Good
output:"Goody
output:"Goodye
output:"Goodye
output:"Goodye W
output:"Goodye Wo
output:"Goodye Wor
output:"Goodye Worl
output:"Goodye World
output:"Goodye World"
output:string
output:string
TWhy does my loop correctly work only when I'm not trying to reset the tok
variable? I have to reset the variable, otherwise other parts of my program wouldn't work. Is there an alternative solution for resetting my tok
variable?
It is obvious that tok
is a std::string
, so:
for(unsigned int i=0; i < data_str.length(); i++)
{
tok += data_str[i];
if(tok[i] == '\n')
{
//cout << "NEWLINE" << endl;
tok = "";
}
Let's use paper and pencil, and follow along just this part of the parsing algorithm. Assuming that data_str
consists of the following text:
"hello\nworld"
After data_str[5]
gets appended to tok
, so that tok
now contains "hello\n"
, since tok[5]
is '\n'
, tok gets cleared to an empty string.
On the next iteration, data_str[6]
gets appended to an empty tok
, so tok
now contains just a "w" (since it was cleared on the previous iteration of the loop).
if(tok[i] == '\n')
i
is now 6. This checks tok[6]
. Of course, tok
has only one character. This results in undefined behavior, and meaningless result.
Things go pretty much off the rails, from this point forward.
If the intent here is to clear the tok
buffer after every newline, check the last character of tok
, which would be tok[tok.size()-1]
, instead of tok[i]
, since i
and the size of tok
have absolutely nothing to do with each other, whatsoever.