Search code examples
c++tokendelimiterstrtok

strtok how to also include delimiters as tokens


Right now I have code set up to divide up my string into tokens with delimiters of ,;= and space. I would also like to include the special characters as tokens.

char * cstr = new char [str.length()+1];
strcpy (cstr, str.c_str());

char * p = strtok (cstr," ");

while (p!=0)
{
    whichType(p);
    p = strtok(NULL," ,;=");
}

So right now if I print out the tokens of a string such as, asd sdf qwe wer,sdf;wer it would be

asd
sdf
qwe
wer
sdf
wer

I want it to look like

asd
sdf
qwe
wer
,
sdf
;
wer

Any help would be great. Thanks


Solution

  • You need more flexibility. (Besides, strtok is a bad, error prone interface).

    Here's a flexible algorithm that generates tokens, copying them to an output iterator. This means you can use it to fill a container of your choice, or print it directly to an output stream (which is what I'll use as a demo).

    The behaviour is specified in option flags:

    enum tokenize_options
    {
        tokenize_skip_empty_tokens              = 1 << 0,
        tokenize_include_delimiters             = 1 << 1,
        tokenize_exclude_whitespace_delimiters  = 1 << 2,
        //
        tokenize_options_none    = 0,
        tokenize_default_options =   tokenize_skip_empty_tokens 
                                   | tokenize_exclude_whitespace_delimiters
                                   | tokenize_include_delimiters,
    };
    

    Not how I actually distilled an extra requirement that you hadn't named, but your sample implies: you want the delimiters output as tokens unless they're whitespace (' '). This is what the third option comes in for: tokenize_exclude_whitespace_delimiters.

    Now here's the real meat:

    template <typename Input, typename Delimiters, typename Out>
    Out tokenize(
            Input const& input,
            Delimiters const& delim,
            Out out,
            tokenize_options options = tokenize_default_options
            )
    {
        // decode option flags
        const bool includeDelim   = options & tokenize_include_delimiters;
        const bool excludeWsDelim = options & tokenize_exclude_whitespace_delimiters;
        const bool skipEmpty      = options & tokenize_skip_empty_tokens;
    
        using namespace std;
        string accum;
    
        for(auto it = begin(input), last = end(input); it != last; ++it)
        {
            if (find(begin(delim), end(delim), *it) == end(delim))
            {
                accum += *it;
            }
            else
            {
                // output the token
                if (!(skipEmpty && accum.empty()))
                    *out++ = accum;   // optionally skip if `accum.empty()`?
    
                // output the delimiter
                bool isWhitespace = std::isspace(*it) || (*it == '\0'); 
                if (includeDelim && !(excludeWsDelim && isWhitespace))
                {
                    *out++ = { *it }; // dump the delimiter as a separate token
                }
    
                accum.clear();
            }
        }
    
        if (!accum.empty())
            *out++ = accum;
    
        return out;
    }
    

    A full demo is Live on Ideone (default options) and Live on Coliru (no options)

    int main()
    {
        // let's print tokens to stdout
        std::ostringstream oss;
        std::ostream_iterator<std::string> out(oss, "\n"); 
    
        tokenize("asd sdf qwe wer,sdf;wer", " ;,", out/*, tokenize_options_none*/);
    
        std::cout << oss.str();
        // that's all, folks
    }
    

    Prints:

    asd
    sdf
    qwe
    wer
    ,
    sdf
    ;
    wer