I have written the following simple MRE that regenerates a bug in my program:
#include <iostream>
#include <utility>
#include <sstream>
#include <string_view>
#include <array>
#include <vector>
#include <iterator>
// this function is working fine only if string_view contains all the user provided chars and nothing extra like null bytes
std::pair< bool, std::vector< std::string > > tokenize( const std::string_view inputStr, const std::size_t expectedTokenCount )
{
// unnecessary implementation details
std::stringstream ss;
ss << inputStr.data( ); // works for null-terminated strings, but not for the non-null terminated strings
// unnecessary implementation details
}
int main( )
{
constexpr std::size_t REQUIRED_TOKENS_COUNT { 3 };
std::array<char, 50> input_buffer { };
std::cin.getline( input_buffer.data( ), input_buffer.size( ) ); // user can enter at max 50 characters
const auto [ hasExpectedTokenCount, foundTokens ] { tokenize( { input_buffer.data( ), input_buffer.size( ) }, REQUIRED_TOKENS_COUNT ) };
for ( const auto& token : foundTokens ) // print the tokens
{
std::cout << '\'' << token << "' ";
}
std::cout << '\n';
}
This is a program for tokenization (for full code see Compiler Explorer at the link below). Also, I use GCC v11.2.
First of all, I want to avoid using data()
since it's a bit less efficient.
I looked at the assembly in Compiler Explorer and apparently, data()
calls strlen()
so when it reaches the first null byte it stops. But what if the string_view
object is not null-terminated? That's a bit concerning. So I switched to ss << inputStr;
.
Secondly, when I do this ss << inputStr;
, the whole 50 character buffer is inserted into ss
with all of its null bytes. Below are some sample outputs that are wrong:
sample #1:
1 2 3
'1' '2' '3 ' // '1' and '2' are correct, '3' has lots of null bytes
sample #2 (in this one I typed a space character after 3):
1 2 3
'1' '2' '3' ' ' // an extra token consisting of 1 space char and lots of null bytes has been created!
Is there a way to fix this? What should I do now to also support non-null terminated strings? I came up with the idea of gcount()
as below:
const std::streamsize charCount { std::cin.gcount( ) };
// here I pass charCount instead of the size of buffer
const auto [ hasExpectedTokenCount, foundTokens ] { tokenize( { input_buffer.data( ), charCount },
REQUIRED_TOKENS_COUNT ) };
But the problem is that when the user enters less characters than the buffer size, gcount()
returns a value that is 1 more than the actual number of entered char
s (e.g. user enters 5 characters but gcount
returns 6 apparently also taking '\0' into account).
This causes the last token to also have a null byte at its end:
1 2 3
'1' '2' '3 ' // see the null byte in '3 ', it's NOT a space char
How should I fix gcount
's inconsistent output?
Or maybe I should change the function tokenize
so that it gets rid of any '\0' at the end of the string_view
and then starts to tokenize it.
It might sound like an XY problem though. But I really need help to decide what to do.
The basic problem you have is with the operator<<
functions. You've tried two of them:
operator<<(ostream &, const char *)
which will take characters from the pointer up to (and not including) the next NUL. As you've noted, that may be a problem if the pointer comes from a string_view without a terminating NUL.operator<<(ostream &, const string_view &)
which will take all the characters from the string_view including any NULs that may be present.It seems that what you want to do is take characters from the string_view up to (and not including) the first NUL or the end of the string_view, whichever comes first. You can do that with find
and constructing a substr up to the NUL or end:
ss << inputStr.substr(0, inputStr.find('\0'));