This seems trivial but I can't seem to get around this. I have STL strings of the format 2013 336 (02 DEC) 04
(where 04
is the hour, but that's irrelevant). I'd like to extract the day of the month (02
in the example) and the month as well as the hour.
I'm trying to do this cleanly and avoid e.g. splitting the string at the parentheses and then working with substrings etc. Ideally I'd like to use a stringstream
and just redirect it to variables. The code I've got right now is:
int year, dayOfYear, day;
std::string month, leftParenthesis, rightParenthesis;
std::string ExampleString = "2013 336 (02 DEC) 04";
std::istringstream yearDayMonthHourStringStream( ExampleString );
yearDayMonthHourStringStream >> year >> dayOfYear >> leftParenthesis >> day >> month >> rightParenthesis >> hour;
It extracts the year
and dayOfYear
alright as 2013
and 336
but then things start going badly. day
is 0
, month
and empty string, and hour
843076624.
leftParenthesis
is (02
so it contains the day
but when I try to omit leftParenthesis
variable while redirecting the yearDayMonthHourStringStream
stream day
is also 0
.
Any ideas on how to deal with this? I don't know regular expressions (yet) and, admittedly, not sure if I can afford to learn them right now (timewise).
EDIT OK, I've got it. Although this is like the billionth time when I could make my life just so much easier with regex, so I guess it's time. Anyway, what worked was:
int year, dayOfYear, day, month, hour, minute, revolution;
std::string dayString, monthString;
yearDayMonthHourStringStream >> year >> dayOfYear >> dayString >> monthString >> hour;
std::string::size_type sz;
day = std::stod( dayString.substr( dayString.find("(")+1 ), &sz ); // Convert day to a number using C++11 standard. Ignore the ( that may be at the beginning.
This still requires handling of monthString
, but I need to change it to a number anyway, so that isn't a huge disadvantage. Not the best thing you can do (regex) but works and isn't too dirty. To my knowledge also vaguely portable and hopefully won't stop working with new compilers. But thanks everyone.
The obvious solution is to use regular expressions (either
std::regex
, in C++11, or boost::regex
pre C++11). Just
capture the groups you're interested in, and use
std::istringstream
to convert them if necessary. In this
case,
std::regex re( "\\s*\\d+\\s+\\d+\\s*\\((\\d+)\\s+([[:alpha:]]+))\\s*(\\d+)" );
Should do the trick.
And regular expressions are really quite simple; it will take you less time to learn them than to implement any alternative solution.
For an alternative solution, you'd probably want to read the line character by character, breaking it into tokens. Something along the line:
std::vector<std::string> tokens;
std::string currentToken;
char ch;
while ( source.get(ch) && ch != '\n' ) {
if ( std::isspace( static_cast<unsigned char>( ch ) ) ) {
if ( !currentToken.empty() ) {
tokens.push_back( currentToken );
currentToken = "";
}
} else if ( std::ispunct( static_cast<unsigned char>( ch ) ) ) {
if ( !currentToken.empty() ) {
tokens.push_back( currentToken );
currentToken = "";
}
currentToken.push_back( ch );
} else if ( std::isalnum( static_cast<unsigned char>( ch ) ) ) {
currentToken.push_back( ch );
} else {
// Error: illegal character in line. You'll probably
// want to throw an exception.
}
}
if ( !currentToken.empty() ) {
tokens.push_back( currentToken );
}
In this case, a sequence of alphanumeric characters is one token, as is any single punctuation character. You could go further, ensuring that a token is either all alpha, or all digits, and maybe regrouping sequences of punctuation, but this seems sufficient for your problem.
Once you've got the list of tokens, you can do any necessary verifications (parentheses in the right places, etc.), and convert the tokens you're interested in, if they need converting.
EDIT:
FWIW: I've been experimenting with using auto
plus a lambda as
a means of defining nested functions. My mind's not made up as
to whether it's a good idea or not: I don't always find the
results that readable. But in this case:
auto pushToken = [&]() {
if ( !currentToken.empty() ) {
tokens.push_back( currentToken );
currentToken = "";
}
}
Just before the loop, then replace all of the if
with
pushToken()
. (Or you could create a data structure with
tokens
, currentToken
and a pushToken
member function.
This would work even in pre-C++11.)
EDIT:
One final remark, since the OP seems to want to do this
exclusively with std::istream
: the solution there would be to
add a MustMatch
manipulator:
class MustMatch
{
char m_toMatch;
public:
MustMatch( char toMatch ) : m_toMatch( toMatch ) {}
friend std::istream& operator>>( std::istream& source, MustMatch const& manip )
{
char next;
source >> next;
// or source.get( next ) if you don't want to skip whitespace.
if ( source && next != m_toMatch ) {
source.setstate( std::ios_base::failbit );
}
return source;
}
}
As @Angew has pointed out, you'd also need a >>
for the
months; typically, months would be represented as a class, so
you'd overload >>
on this:
std::istream& operator>>( std::istream& source, Month& object )
{
// The sentry takes care of skipping whitespace, etc.
std::ostream::sentry guard( source );
if ( guard ) {
std::streambuf* sb = source.rd();
std::string monthName;
while ( std::isalpha( sb->sgetc() ) ) {
monthName += sb->sbumpc();
}
if ( !isLegalMonthName( monthName ) ) {
source.setstate( std::ios_base::failbit );
} else {
object = Month( monthName );
}
}
return source;
}
You could, of course, introduce many variants here: the month
name could be limited to a maximum of 3 characters, for example
(by making the loop condition monthName.size() < 3 &&
std::isalpha( sb->sgetc() )
). But if you're dealing with
months in any way in your code, writing a Month
class and its
>>
and <<
operators is something you'll have to do sooner or
later anyway.
Then something like:
source >> year >> dayOfYear >> MustMatch( '(' ) >> day >> month
>> MustMatch( ')' ) >> hour;
if ( !(source >> ws) || source.get() != EOF ) {
// Format error...
}
is all that is needed. (The use of manipulators like this is another technique worth learning.)