Search code examples
c++performancebufferifstreamlarge-files

Retrieving File Data Stored in Buffer


I'm new to the forum, but not to this website. I've been searching for weeks on how to process a large data file quickly using C++ 11. I'm trying to have a function with a member that will capture the trace file name, open and process the data. The trace file contains 2 million lines of data, and each line is structured with a read/write operation and a hex address:

r abcdef123456

However, with a file having that much data, I need to read in and parse those 2 values quickly. My first attempt to read the file was the following:

void getTraceData(string filename)
{
  ifstream inputfile;
  string file_str;
  vector<string> op, addr;

  // Open input file
  inputfile.open(filename.c_str());
  cout << "Opening file for reading: " << filename << endl;

  // Determine if file opened successfully
  if(inputfile.fail())
  {
    cout << "Text file failed to open." << endl;
    cout << "Please check file name and path." << endl;
    exit(1);
  }

  // Retrieve and store address values and operations
  if(inputfile.is_open())
  {
    cout << "Text file opened successfully." << endl;

    while(inputfile >> file_str)
    {
      if((file_str == "r") || (file_str == "w"))
      {
        op.push_back(file_str);
      }
      else
      {
        addr.push_back(file_str);
      }
    }
  }
  inputfile.close();
  cout << "File closed." << endl;
 }

It worked, it ran, and read in the file. Unfortunately, it took the program 8 minutes to run and read the file. I modified the first program to the second program, to try and read the file in faster. It did, reading the file into a buffer in a fraction of a second versus 8 mins. using ifstream:

void getTraceData()
{
  	// Setup variables
	char* fbuffer;
	ifstream ifs("text.txt");
	long int length;
	clock_t start, end;

	// Start timer + get file length
	start = clock();
	ifs.seekg(0, ifs.end);
	length = ifs.tellg();
	ifs.seekg(0, ifs.beg);

	// Setup buffer to read & store file data
	fbuffer = new char[length];
	ifs.read(fbuffer, length);
	ifs.close();
	end = clock();

	float diff((float)end - (float)start);
	float seconds = diff / CLOCKS_PER_SEC;

	cout << "Run time: " << seconds << " seconds" << endl;

	delete[] fbuffer;
}

But when I added the parsing portion of the code, to get each line, and parsing the buffer contents line-by-line to store the two values in two separate variables, the program silently exits at the while-loop containing getline from the buffer:

void getTraceData(string filename)
{
	// Setup variables
	char* fbuffer;
	ifstream ifs("text.txt");
	long int length;
	string op, addr, line;
	clock_t start, end;

	// Start timer + get file length
	start = clock();
	ifs.seekg(0, ifs.end);
	length = ifs.tellg();
	ifs.seekg(0, ifs.beg);

	// Setup buffer to read & store file data
	fbuffer = new char[length];
	ifs.read(fbuffer, length);
	ifs.close();

	// Setup stream buffer
	const int maxline = 20;
	char* lbuffer;
	stringstream ss;

	// Parse buffer data line-by-line
	while(ss.getline(lbuffer, length))
	{
		while(getline(ss, line))
		{
			ss >> op >> addr;
		}
		ss.ignore( strlen(lbuffer));
	}
	end = clock();

	float diff((float)end - (float)start);
	float seconds = diff / CLOCKS_PER_SEC;

	cout << "Run time: " << seconds << " seconds" << endl;

	delete[] fbuffer;
	delete[] lbuffer;  
}

I was wondering, once my file is read into a buffer, how do I retrieve it and store it into variables? For added value, my benchmark time is under 2 mins. to read and process the data file. But right now, I'm just focused on the input file, and not the rest of my program or the machine it runs on (the code is portable to other machines). The language is C++ 11 and the OS is a Linux computer. Sorry for the long posting.


Solution

  • Your stringstream ss is not associated to fbuffer at all. You are trying to getline from an empty stringstream, thus nothing happens. Try this:

    string inputedString(fbuffer);
    istringstream ss(fbuffer);
    

    And before ss.getline(lbuffer, length), please allocate memory for lbuffer.

    Actually you can directly read your file into a string to avoid the copy construction. Check this Reading directly from an std::istream into an std::string .

    Last but not least, since your vector is quite large, you'd better reserve enough space for it before push_back the items one by one. When a vector reaches its capacity, attempt to push_back another item into it will result in reallocation and copy of all previous items in order to ensure continuous storage. Millions of items will make that happen quite a few times.