Search code examples
c++boostboost-tokenizer

Removing duplicates from Boost::Tokenizer?


I am trying to split a comma-separated string and then perform some action on each token, but ignoring duplicates, so sth. along the following lines:

int main(int, char**)
{
   string text = "token, test   string";

  char_separator<char> sep(", ");
  tokenizer< char_separator<char> > tokens(text, sep);
  // remove duplicates from tokens?
  BOOST_FOREACH (const string& t, tokens) {
    cout << t << "." << endl;
  }
}

Is there a way to do this on the boost::tokenizer?

I know that I can solve this problem using boost::split and std::unique, but was wondering whether there is a way to achieve this with the tokenizer as well.


Solution

  • boost.tokenizer can do many cool things, but it cannot do this, the answer is indeed "no".

    If you're only looking to drop adjacent duplicates, boost.range can help make it seemless:

    #include <iostream>
    #include <string>
    #include <boost/range/adaptor/uniqued.hpp>
    #include <boost/foreach.hpp>
    #include <boost/tokenizer.hpp>
    
    using namespace boost;
    using namespace boost::adaptors;
    int main()
    {
        std::string text = "token, test   string test, test   test";
    
        char_separator<char> sep(", ");
        tokenizer< char_separator<char> > tokens(text, sep);
        BOOST_FOREACH (const std::string& t, tokens | uniqued ) {
            std::cout << t << "." << '\n';
        }
    }
    

    This prints:

    token.
    test.
    string.
    test.
    

    In order to do some action only on globally unique tokens, you will need to store state, one way or another. The simplest solution is probably an intermediate set:

    char_separator<char> sep(", ");
    tokenizer< char_separator<char> > tokens(text, sep);
    std::set<std::string> unique_tokens(tokens.begin(), tokens.end());
    BOOST_FOREACH (const std::string& t, unique_tokens) {
            std::cout << t << "." << '\n';
    }