Search code examples
c++boostboost-regex

Extracting substrings using Boost::Regex from textfile


So I have emails in many different text files, and I need to extract them from said files, which are not consistent in layout. I'm using Boost::Regex and Boost::File-system to try and read them, and then extract the email address. However it doesn't seem to be working in finding or pulling out the email. It can match say simple words like email or the letter a. But it seems to be having trouble with actual reading out the file.

A minimal example is as follows (no includes):

#include <fstream>
#include <iostream>
#include <sstream>
#include <string>

#include <boost/regex.hpp>
#include <boost/foreach.hpp>
#include <boost/filesystem.hpp>


namespace fs = boost::filesystem;   // File system is namespace.

int main() {
    boost::regex pattern("\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b");  // Email regex to match.
    boost::smatch result;

    fs::path targetDir(boost::filesystem::current_path());  // Look in this folder.
    fs::directory_iterator it(targetDir), eod;      // Iterate over all the files in said directory.
    std::string line;
    BOOST_FOREACH(fs::path const &p, std::make_pair(it, eod)) { // Actual iteration.
        if (fs::is_regular_file(p)) {   // What this does is checks if it's a normal file. 
            std::ifstream infile(p.string());   // Read file line by line. 
            if (p.string().substr(p.string().length() - 3) != "txt") {
                continue;   // Skip to next file if not text file. 
            }
            while (std::getline(infile, line)) {
                bool isMatchFound = boost::regex_search(line, result, pattern);
                if (isMatchFound)
                {
                    for (unsigned int i = 0; i < result.size(); i++)
                    {
                        std::cout << result[i] << std::endl;
                    }
                }
            }
            infile.close();
        }    
    }
    return 0;
}

I'm not sure why it's not working: A sample of the emails can be as follows:

"[email protected]","S"
"[email protected]","R"
[email protected]<br>

And various other ways the email can be in the textfile, how do I get this regex to match?


Solution

  • The regex is flawed. \b means something else:

    enter image description here

    Also, \. is an illegal escape sequence, so your compiler should have warned. (You need \\.)

    Finally, \b is Perl-compatible regex I think. Oh, and you didn't just want uppercase emails, right. So lets fix it:

    boost::regex pattern("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}\\b",
        boost::regex_constants::perl | boost::regex_constants::icase);  // Email regex to match.
    

    Perhaps it would be a bit better to use an rfc822 parser library :)

    Here's a cleaned piece of code:

    #include <boost/filesystem.hpp>
    #include <boost/range/iterator_range.hpp>
    #include <boost/regex.hpp>
    #include <fstream>
    #include <iostream>
    namespace fs = boost::filesystem;
    
    int main() {
        boost::regex pattern("\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,}\\b",
            boost::regex_constants::perl | boost::regex_constants::icase);  // Email regex to match.
        boost::smatch result;
        std::string line;
    
        for (fs::path p : boost::make_iterator_range(fs::directory_iterator("."), {})) {
            if (!fs::is_regular_file(p) || p.extension() != ".txt")
                continue;
    
            std::cerr << "Reading " << p << "\n";
    
            std::ifstream infile(p.string()); // Read file line by line
            while (std::getline(infile, line)) {
                if (boost::regex_search(line, result, pattern)) {
                    std::cout << "\t" << result.str() << "\n";
                }
            }
        }    
    }
    

    Notes:

    • if you use Boost Filesystem, you will NOT do flawed string manipulation instead of using the extension() accessor function
    • reduce the nesting of conditions if possible
    • no redundant closing of files (this is C++, files close when they go out of scope)
    • don't bother with submatch groups, since you don't use them
    • print the str() value of the match

    On my test folder it printed (including stderr):

    Reading "./input.txt"
        [email protected]
        [email protected]
        [email protected]
    Reading "./output.txt"
    Reading "./big.txt"
    Reading "./CMakeLists.txt"
    Reading "./CMakeCache.txt"