Search code examples
c++fileparsingcsvifstream

Parse only specific columns from csv file using token


If I have a file filled with comma separated values, such as this:

"myComputer",5,192.168.1.0,25
"herComputer",6,192.168.1.1,26
"hisComputer",7,192.168.1.2,27

And I want to pull the data out as a string, I would do something like this:

std::string line;
std::ifstream myfile ("myCSVFile.txt");

if(myfile.is_open())
{
    while(getline(myfile,line))
    {
        std::string tempString = line;
        std::string delimiter = ",";
    }
}

In order to parse out each value by itself, I use something like this: Parse (split) a string in C++ using string delimiter (standard C++)

std::string s = "scott>=tiger>=mushroom";
std::string delimiter = ">=";

size_t pos = 0;
std::string token;
while ((pos = s.find(delimiter)) != std::string::npos) {
    token = s.substr(0, pos);
    std::cout << token << std::endl;
    s.erase(0, pos + delimiter.length());
}
std::cout << s << std::endl;

The question is, what if I only want, the first and third value? So if I wanted my csv file from above, to only output

"myComputer" 192.168.1.0
"herComputer" 192.168.1.1
"hisComputer" 192.168.1.2

Is there a way to achieve this using the methods above, or should I use a completely different method? Thanks,


Solution

  • It's much easier to use a dedicated library for this task. With Boost Tokenizer's Escaped List Separator, it's a breeze:

    #include <vector>
    #include <string>
    #include <iostream>
    #include <fstream>
    #include <boost/tokenizer.hpp>
    
    int main()
    {
        std::ifstream myfile("myCSVFile.txt");
    
        if (myfile.is_open())
        {
            std::string line;
            while (std::getline(myfile, line))
            {
                typedef boost::escaped_list_separator<char> Separator;
                typedef boost::tokenizer<Separator> Tokenizer;
    
                std::vector<std::string> tokens;
                Tokenizer tokenizer(line);
                for (Tokenizer::iterator iter = tokenizer.begin(); iter != tokenizer.end(); ++iter)
                {
                   tokens.push_back(*iter);
                }
    
                if (tokens.size() == 4)
                {
                    std::cout << tokens[0] << "\t" << tokens[2] << "\n";
                }
                else
                {
                    std::cerr << "illegal line\n";
                }
            }
        }
    }
    

    Note that in C++11, you can simplify the loop:

    for (auto &token : tokenizer)
    {
        tokens.push_back(token);
    }          
    

    As you can see, the idea is to just store all values of a line in a std::vector and then output what's required.

    Now this may lead to performance problems if you really deal with huge files. In that case, use a counter together with the tokenizer:

    #include <vector>
    #include <string>
    #include <iostream>
    #include <fstream>
    #include <boost/tokenizer.hpp>
    
    int main()
    {
        std::ifstream myfile("myCSVFile.txt");
    
        if (myfile.is_open())
        {
            std::string line;
            while (std::getline(myfile, line))
            {
                typedef boost::escaped_list_separator<char> Separator;
                typedef boost::tokenizer<Separator> Tokenizer;
    
                Tokenizer tokenizer(line);
                int count = 0;
                for (Tokenizer::iterator iter = tokenizer.begin(); (iter != tokenizer.end()) && (count < 3); ++iter)
                {
                    if ((count == 0) || (count == 2))
                    {
                        std::cout << *iter;
                        if (count == 0)
                        {
                            std::cout << "\t";
                        }
                    }
                    ++count;
                }
                std::cout << "\n";
            }
        }
    }
    

    You can use both techniques (std::vector<std::string> with later output or loop with counter) even with your self-made string-splitting algorithm. The basic idea is the same:

    With std::vector<std::string>:

    std::vector<std::string> tokens;
    while ((pos = s.find(delimiter)) != std::string::npos) {
        token = s.substr(0, pos);
        tokens.push_back(token);
        s.erase(0, pos + delimiter.length());
    }
    
    if (tokens.size() == 4)
    {
        std::cout << tokens[0] << "\t" << tokens[2] << "\n";
    }
    else
    {
        std::cerr << "illegal line\n";
    }
    

    With a counter:

    int count = 0;
    while ((pos = s.find(delimiter)) != std::string::npos && (count < 4)) {
        token = s.substr(0, pos);
    
        if ((count == 0) || (count == 2))
        {
            std::cout << token;
            if (count == 0)
            {
                std::cout << "\t";
            }
        }
        ++count;
        s.erase(0, pos + delimiter.length());
    }