Search code examples
c++parsingtokenize

How to tokenize special characters depending on whitespace (< > | & etc.)


I found a project done a few years ago found here that does some simple command line parsing. While I really like it's functionality, it does not support parsing special characters, such as <, >, &, etc. I went ahead and attempted to add some functionality to parse these characters specifically by adding some of the same conditions that the existing code used to look for whitespace, escape characters, and quotes:

bool _isQuote(char c) {
    if (c == '\"')
            return true;
    else if (c == '\'')
            return true;

    return false;
}

bool _isEscape(char c) {
    if (c == '\\')
        return true;

    return false;
}

bool _isWhitespace(char c) {
    if (c == ' ')
        return true;
    else if(c == '\t')
        return true;

    return false;
}
.
.
.

What I added:

bool _isLeftCarrot(char c) {
    if (c == '<')
        return true;

    return false;
}

bool _isRightCarrot(char c) {
    if (c == '>')
        return true;

    return false;
}

and so on for the rest of the special characters.

I also tried the same approach as the existing code in the parse method:

std::list<string> parse(const std::string& args) {

    std::stringstream ain(args);            // iterates over the input string
    ain >> std::noskipws;                   // ensures not to skip whitespace
    std::list<std::string> oargs;           // list of strings where we will store the tokens

    std::stringstream currentArg("");
    currentArg >> std::noskipws;

    // current state
    enum State {
            InArg,          // scanning the string currently
            InArgQuote,     // scanning the string that started with a quote currently 
            OutOfArg        // not scanning the string currently
    };
    State currentState = OutOfArg;

    char currentQuoteChar = '\0';   // used to differentiate between ' and "
                                    // ex. "sample'text" 

    char c;
    std::stringstream ss;
    std::string s;
    // iterate character by character through input string
    while(!ain.eof() && (ain >> c)) {

            // if current character is a quote
            if(_isQuote(c)) {
                    switch(currentState) {
                            case OutOfArg:
                                    currentArg.str(std::string());
                            case InArg:
                                    currentState = InArgQuote;
                                    currentQuoteChar = c;
                                    break;
                            case InArgQuote:
                                    if (c == currentQuoteChar)
                                            currentState = InArg;
                                    else
                                            currentArg << c;
                                    break;
                    }
            }
            // if current character is whitespace
            else if (_isWhitespace(c)) {
                        switch(currentState) {
                            case InArg:
                                    oargs.push_back(currentArg.str());
                                    currentState = OutOfArg;
                                    break;
                            case InArgQuote:
                                    currentArg << c;
                                    break;
                            case OutOfArg:
                                    // nothing
                                    break;
                    }
            }
            // if current character is escape character
            else if (_isEscape(c)) {
                    switch(currentState) {
                            case OutOfArg:
                                    currentArg.str(std::string());
                                    currentState = InArg;
                            case InArg:
                            case InArgQuote:
                                    if (ain.eof())
                                    {
                                            currentArg << c;
                                            throw(std::runtime_error("Found Escape Character at end of file."));
                                    }
                                    else {
                                            char c1 = c;
                                            ain >> c;
                                            if (c != '\"')
                                                    currentArg << c1;
                                            ain.unget();
                                            ain >> c;
                                            currentArg << c;
                                    }
                                    break;
                    }
            }

What I added in the parse method:

            // if current character is left carrot (<)
            else if(_isLeftCarrot(c)) {
                    // convert from char to string and push onto list
                    ss << c;
                    ss >> s;
                    oargs.push_back(s);
            }
            // if current character is right carrot (>)
            else if(_isRightCarrot(c)) {
                    ss << c;
                    ss >> s;
                    oargs.push_back(s);
            }
.
.
.
            else {
                    switch(currentState) {
                            case InArg:
                            case InArgQuote:
                                    currentArg << c;
                                    break;
                            case OutOfArg:
                                    currentArg.str(std::string());
                                    currentArg << c;
                                    currentState = InArg;
                                    break;
                    }
            }
    }

    if (currentState == InArg) {
            oargs.push_back(currentArg.str());
            s.clear();
    }
    else if (currentState == InArgQuote)
            throw(std::runtime_error("Starting quote has no ending quote."));

    return oargs;
}

parse will return a list of strings of the tokens.

However, I am running into issues with a specific test case when the special character is attached to the end of the input. For example, the input

foo-bar&

will return this list: [{&},{foo-bar}] instead of what I want: [{foo-bar},{&}]

I'm struggling to fix this issue. I am new to C++ so any advice along with some explanation would be great help.


Solution

  • When you handle one of your characters, you need to do the same sorts of things that the original code does when it encounters a space. You need to look at the currentState, then save the current argument if you are in the middle of one (and reset it since you no longer are in one).