I need to extract file names with their extension from an input text file to a string vector. The input text file is quite messy and serves as configuration file for some application.
What I know about the file names I am trying to extract is they are preceded by a 'file =' mention, the filename is quoted between ' ' or " ". Example : file="name.abc". I also have no guarantee of how spacing is : It may be file="name.abc", file = "name.abc" , file= "name.abc"... And extension can be of different lengths.
So I tried the following code:
std::vector<std::string> attachment_names;
std::istringstream words(text_content);
std::string word;
std::string pst_extension(".abc"); // My code should support any extension
while (words >> word)
{
auto extension_found = word.find(abc_extension);
if (extension_found != word.npos)
{
auto name_start = word.find("'") + 1;
//I am not even sure the file is quoted by ''
std::string attachment_name = word.substr(name_start, (extension_found + 3) - name_start + 1);
//Doing this annoys me a bit... Especially that the extension may be longer than 3 characters
attachment_names.push_back(attachment_name);
}
}
Is there a nicer way of doing this? Is there a possibility to rely more on the file caption to support any extension?
From C++11 or using boost, my recommendation is that you
use a regular expression with a regex iterator for this problem, since you have variations in the number of spaces and parsing is going to get a bit messy.
A sregex_iterator will traverse the text and match regexes (you can use as a source any bidirectional iterator, for example, strings taken with getline
). A non-tested idea follows:
static std::regex const filename_re("[[:space:]]*file[[:space:]]*=(.*)[[:space:]]*");
std::regex_iterator rit(line.begin(), line.end(), filename_re), end;
while (rit != end) {
cout << rit[1] << ',';
++rit;
}
This, taking for each iteration your line, would get the filename found and print it, since the capture group captures the filename.