Search code examples
c++boosttokenize

boost::split pushes an empty string to the vector even with token_compress_on


When the input string is blank, boost::split returns a vector with one empty string in it.

Is it possible to have boost::split return an empty vector instead?

MCVE:

#include <string>
#include <vector>
#include <boost/algorithm/string.hpp>

int main() {
    std::vector<std::string> result;
    boost::split(result, "", boost::is_any_of(","), boost::algorithm::token_compress_on);
    std::cout << result.size();
}

Output:

1

Desired output:

0

Solution

  • Compression compresses adjacent delimiters, it does not avoid empty tokens.

    If you consider the following, you can see why this works consistently:

    Live On Coliru

    #include <boost/algorithm/string.hpp>
    #include <string>
    #include <iostream>
    #include <iomanip>
    #include <vector>
    
    int main() {
        for (std::string const& test : {
                "", "token", 
                ",", "token,", ",token", 
                ",,", ",token,", ",,token", "token,,"
            })
        {
            std::vector<std::string> result;
            boost::split(result, test, boost::is_any_of(","), boost::algorithm::token_compress_on);
            std::cout << "\n=== TEST: " << std::left << std::setw(8) << test << " === ";
            for (auto& tok : result)
                std::cout << std::quoted(tok, '\'') << " ";
        }
    }
    

    Prints

    === TEST:          === '' 
    === TEST: token    === 'token' 
    === TEST: ,        === '' '' 
    === TEST: token,   === 'token' '' 
    === TEST: ,token   === '' 'token' 
    === TEST: ,,       === '' '' 
    === TEST: ,token,  === '' 'token' '' 
    === TEST: ,,token  === '' 'token' 
    === TEST: token,,  === 'token' '' 
    

    So, you might fix it by trimming delimiters from front and end and checking that the remaining input is non-empty:

    Live On Coliru

    #include <boost/algorithm/string.hpp>
    #include <boost/utility/string_view.hpp>
    #include <string>
    #include <iostream>
    #include <iomanip>
    #include <vector>
    
    int main() {
        auto const delim = boost::is_any_of(",");
    
        for (std::string test : {
                "", "token", 
                ",", "token,", ",token", 
                ",,", ",token,", ",,token", "token,,"
            })
        {
            std::cout << "\n=== TEST: " << std::left << std::setw(8) << test << " === ";
    
            std::vector<std::string> result;
    
            boost::trim_if(test, delim);
            if (!test.empty())
                boost::split(result, test, delim, boost::algorithm::token_compress_on);
    
            for (auto& tok : result)
                std::cout << std::quoted(tok, '\'') << " ";
        }
    }
    

    Printing:

    === TEST:          === 
    === TEST: token    === 'token' 
    === TEST: ,        === 
    === TEST: token,   === 'token' 
    === TEST: ,token   === 'token' 
    === TEST: ,,       === 
    === TEST: ,token,  === 'token' 
    === TEST: ,,token  === 'token' 
    === TEST: token,,  === 'token' 
    

    BONUS: Boost Spirit

    Using Spirit X3, seems to me to be more flexible and potentially more efficient:

    Live On Coliru

    #include <boost/spirit/home/x3.hpp>
    #include <string>
    #include <iostream>
    #include <iomanip>
    #include <vector>
    
    int main() {
        static auto const delim = boost::spirit::x3::char_(",");
    
        for (std::string test : {
                "", "token", 
                ",", "token,", ",token", 
                ",,", ",token,", ",,token", "token,,"
            })
        {
            std::cout << "\n=== TEST: " << std::left << std::setw(8) << test << " === ";
    
            std::vector<std::string> result;
            parse(test.begin(), test.end(), -(+~delim) % delim, result);
    
            for (auto& tok : result)
                std::cout << std::quoted(tok, '\'') << " ";
        }
    }