Search code examples
c++c++11stringstream

C++: Parsing a string of numbers with parentheses in it


This seems trivial but I can't seem to get around this. I have STL strings of the format 2013 336 (02 DEC) 04 (where 04 is the hour, but that's irrelevant). I'd like to extract the day of the month (02 in the example) and the month as well as the hour.

I'm trying to do this cleanly and avoid e.g. splitting the string at the parentheses and then working with substrings etc. Ideally I'd like to use a stringstream and just redirect it to variables. The code I've got right now is:

int year, dayOfYear, day;
std::string month, leftParenthesis, rightParenthesis;
std::string ExampleString = "2013 336 (02 DEC) 04";

std::istringstream yearDayMonthHourStringStream( ExampleString );
yearDayMonthHourStringStream >> year >> dayOfYear >> leftParenthesis >> day >> month >> rightParenthesis >> hour;

It extracts the year and dayOfYear alright as 2013 and 336 but then things start going badly. day is 0, month and empty string, and hour 843076624.

leftParenthesis is (02 so it contains the day but when I try to omit leftParenthesis variable while redirecting the yearDayMonthHourStringStream stream day is also 0.

Any ideas on how to deal with this? I don't know regular expressions (yet) and, admittedly, not sure if I can afford to learn them right now (timewise).

EDIT OK, I've got it. Although this is like the billionth time when I could make my life just so much easier with regex, so I guess it's time. Anyway, what worked was:

int year, dayOfYear, day, month, hour, minute, revolution;
std::string dayString, monthString;

yearDayMonthHourStringStream >> year >> dayOfYear >> dayString >> monthString >> hour;
std::string::size_type sz;
day = std::stod( dayString.substr( dayString.find("(")+1 ), &sz ); // Convert day to a number using C++11 standard. Ignore the ( that may be at the beginning.

This still requires handling of monthString, but I need to change it to a number anyway, so that isn't a huge disadvantage. Not the best thing you can do (regex) but works and isn't too dirty. To my knowledge also vaguely portable and hopefully won't stop working with new compilers. But thanks everyone.


Solution

  • The obvious solution is to use regular expressions (either std::regex, in C++11, or boost::regex pre C++11). Just capture the groups you're interested in, and use std::istringstream to convert them if necessary. In this case,

    std::regex re( "\\s*\\d+\\s+\\d+\\s*\\((\\d+)\\s+([[:alpha:]]+))\\s*(\\d+)" );
    

    Should do the trick.

    And regular expressions are really quite simple; it will take you less time to learn them than to implement any alternative solution.

    For an alternative solution, you'd probably want to read the line character by character, breaking it into tokens. Something along the line:

    std::vector<std::string> tokens;
    std::string currentToken;
    char ch;
    while ( source.get(ch) && ch != '\n' ) {
        if ( std::isspace( static_cast<unsigned char>( ch ) ) ) {
            if ( !currentToken.empty() ) {
                tokens.push_back( currentToken );
                currentToken = "";
            }
        } else if ( std::ispunct( static_cast<unsigned char>( ch ) ) ) {
            if ( !currentToken.empty() ) {
                tokens.push_back( currentToken );
                currentToken = "";
            }
            currentToken.push_back( ch );
        } else if ( std::isalnum( static_cast<unsigned char>( ch ) ) ) {
            currentToken.push_back( ch );
        } else {
            //  Error: illegal character in line.  You'll probably
            //  want to throw an exception.
        }
    }
    if ( !currentToken.empty() ) {
        tokens.push_back( currentToken );
    }
    

    In this case, a sequence of alphanumeric characters is one token, as is any single punctuation character. You could go further, ensuring that a token is either all alpha, or all digits, and maybe regrouping sequences of punctuation, but this seems sufficient for your problem.

    Once you've got the list of tokens, you can do any necessary verifications (parentheses in the right places, etc.), and convert the tokens you're interested in, if they need converting.

    EDIT:

    FWIW: I've been experimenting with using auto plus a lambda as a means of defining nested functions. My mind's not made up as to whether it's a good idea or not: I don't always find the results that readable. But in this case:

    auto pushToken = [&]() {
        if ( !currentToken.empty() ) {
            tokens.push_back( currentToken );
            currentToken = "";
        }
    }
    

    Just before the loop, then replace all of the if with pushToken(). (Or you could create a data structure with tokens, currentToken and a pushToken member function. This would work even in pre-C++11.)

    EDIT:

    One final remark, since the OP seems to want to do this exclusively with std::istream: the solution there would be to add a MustMatch manipulator:

    class MustMatch
    {
        char m_toMatch;
    public:
        MustMatch( char toMatch ) : m_toMatch( toMatch ) {}
        friend std::istream& operator>>( std::istream& source, MustMatch const& manip )
        {
            char next;
            source >> next;
            //  or source.get( next ) if you don't want to skip whitespace.
            if ( source && next != m_toMatch ) {
                source.setstate( std::ios_base::failbit );
            }
            return source;
        }
    }
    

    As @Angew has pointed out, you'd also need a >> for the months; typically, months would be represented as a class, so you'd overload >> on this:

    std::istream& operator>>( std::istream& source, Month& object )
    {
        //      The sentry takes care of skipping whitespace, etc.
        std::ostream::sentry guard( source );
        if ( guard ) {
            std::streambuf* sb = source.rd();
            std::string monthName;
            while ( std::isalpha( sb->sgetc() ) ) {
                monthName += sb->sbumpc();
            }
            if ( !isLegalMonthName( monthName ) ) {
                source.setstate( std::ios_base::failbit );
            } else {
                object = Month( monthName );
            }
        }
        return source;
    }
    

    You could, of course, introduce many variants here: the month name could be limited to a maximum of 3 characters, for example (by making the loop condition monthName.size() < 3 && std::isalpha( sb->sgetc() )). But if you're dealing with months in any way in your code, writing a Month class and its >> and << operators is something you'll have to do sooner or later anyway.

    Then something like:

    source >> year >> dayOfYear >> MustMatch( '(' ) >> day >> month
           >> MustMatch( ')' ) >> hour;
    if ( !(source >> ws) || source.get() != EOF ) {
        //  Format error...
    }
    

    is all that is needed. (The use of manipulators like this is another technique worth learning.)