Search code examples
c++fileistringstream

Extracting file names from text file


I need to extract file names with their extension from an input text file to a string vector. The input text file is quite messy and serves as configuration file for some application.

What I know about the file names I am trying to extract is they are preceded by a 'file =' mention, the filename is quoted between ' ' or " ". Example : file="name.abc". I also have no guarantee of how spacing is : It may be file="name.abc", file = "name.abc" , file= "name.abc"... And extension can be of different lengths.

So I tried the following code:

std::vector<std::string> attachment_names;
std::istringstream words(text_content);
std::string word;
std::string pst_extension(".abc"); // My code should support any extension
while (words >> word)
{
    auto extension_found = word.find(abc_extension);
    if (extension_found != word.npos)
    {
        auto name_start = word.find("'") + 1; 
             //I am not even sure the file is quoted by ''

        std::string attachment_name = word.substr(name_start, (extension_found + 3) - name_start + 1); 
             //Doing this annoys me a bit... Especially that the extension may be longer than 3 characters

        attachment_names.push_back(attachment_name);
    }
}

Is there a nicer way of doing this? Is there a possibility to rely more on the file caption to support any extension?


Solution

  • From C++11 or using boost, my recommendation is that you use a regular expression with a regex iterator for this problem, since you have variations in the number of spaces and parsing is going to get a bit messy. A sregex_iterator will traverse the text and match regexes (you can use as a source any bidirectional iterator, for example, strings taken with getline). A non-tested idea follows:

    static std::regex const filename_re("[[:space:]]*file[[:space:]]*=(.*)[[:space:]]*");
    
    std::regex_iterator rit(line.begin(), line.end(), filename_re), end;
    
    
    while (rit != end) {
      cout << rit[1] << ',';
      ++rit;
    }
    

    This, taking for each iteration your line, would get the filename found and print it, since the capture group captures the filename.