Search code examples
pythonc++regexbenchmarkingstring-search

Can C++ program string search as fast as and/or faster than python?


I'm not sure why I'm having easier time string searching in program I wrote in python faster than a program I wrote in C++. Is there a trick I'm missing?

Generating Use Case

This is for a single line use case, however in the real use case I care about multiple lines.

#include "tchar.h"
#include "stdio.h"
#include "stdlib.h"
#include <string>
#include <sstream>
#include <iostream>
#include <fstream>
#include <ctime>

using namespace std;
void main(void){
   ofstream testfile;
   unsigned int line_idx = 0;
   testfile.open("testfile.txt");
   for(line_idx = 0; line_idx < 50000u; line_idx++)
   {
      if(line_idx != 43268u )
      {
        testfile << line_idx << " dontcare" << std::endl;
      }
      else
      {
        testfile << line_idx << " care" << std::endl;
      }
   }
   testfile.close();
}

The regular expression Using regular expression ^(\d*)\s(care)$

The C++ Program takes 13.954 seconds

#include "tchar.h"
#include "stdio.h"
#include "stdlib.h"
#include <string>
#include <sstream>
#include <iostream>
#include <fstream>
#include <ctime>
using namespace std;

void main(void){
   double duration;
   std::clock_t start;
   ifstream testfile("testfile.txt", ios_base::in);
   unsigned int line_idx = 0;
   bool found = false;
   string line;
   regex ptrn("^(\\d*)\\s(care)$");

   start = std::clock();   /* Debug time */
   while (getline(testfile, line)) 
   {
      std::smatch matches;
      if(regex_search(line, matches, ptrn))
      {
         found = true;
      }
   }
   testfile.close();
   duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
   std::cout << "Found? " << (found ? "yes" : "no") << std::endl;
   std::cout << " Total time: " <<  duration << std::endl;
}

Python Program takes 0.02200 seconds

import sys, os       # to navigate and open files
import re            # to search file
import time          # to benchmark

ptrn  = re.compile(r'^(\d*)\s(care)$', re.MULTILINE)

start = time.time()
with open('testfile.txt','r') as testfile:
   filetext = testfile.read()
   matches = re.findall(ptrn, filetext)
   print("Found? " + "Yes" if len(matches) == 1 else "No")

end = time.time()
print("Total time", end - start)

Solution

  • Implemented Ratah's recommendation to 8.923

    about 5 seconds improvement, by reading file to single string

       double duration;
       std::clock_t start;
       ifstream testfile("testfile.txt", ios_base::in);
       unsigned int line_idx = 0;
       bool found = false;
       string line;
       regex ptrn("^(\\d*)\\s(care)$");
       std::smatch matches;
    
       start = std::clock();   /* Debug time */
       std::string test_str((std::istreambuf_iterator<char>(testfile)),
                     std::istreambuf_iterator<char>());
    
       if(regex_search(test_str, matches, ptrn))
       {
          found = true;
       }
       testfile.close();
       duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
       std::cout << "Found? " << (found ? "yes" : "no") << std::endl;
       std::cout << " Total time: " <<  duration << std::endl;
    

    After UKMonkey's note, reconfigured project to release which also includes \O2 and brought it down to 0.086 seconds

    Thanks to Jean-Francois Fabre, Ratah, UKMonkey