Processing Common Crawl warc files. These are 5gb uncompressed. Inside there is text, xml and warc headers.
This is the code I am particulary having trouble with:
wstring sub = buffer->substr(windowStart, windowSize);
Which give me the error, "expression must have a pointer to class type". I take it that this is because the label is a pointer to heap memory location of that size. therefore, I cannot run any string operations on it. But the -> operator should get the contents that it points to so I can run something like substr?
I am using a simple buffer like this because I understand that mapping the file (MapViewOfFile, etc) to memory is more for random access. it is actually slower if all I need is sequential read?
I would like to read the file sequentially. To improve speed, read the file in chunks to the RAM and then process the ram chunk before getting another chunk from the disk. say 1mb per chunk, etc.
I am not processing all the xml, some will be skipped. grabbing the text and some of the warc headers, skipping the rest.
The idea is to use a sliding window through the file chunk in ram. The window starts where it last left off in the chunk. The window grows in size in a loop. once it gets to a sufficient size, regex is used to check to see if there are any matching tags, headers or text. If so, either skips just that tag, skips ahead so many characters (500 chars in some cases if it comes across a particular type of warc header), writes that tag (if it ones I want to keep), etc.
When the window matches, the windowStart is set to equal the windowEnd and it starts expanding the window again to find the next pattern. Once the buffer ends, it keeps track of any partial tags and refills the buffer from the disk.
The main problem I am running into is how to do the while sliding window. The buffer is a pointer to a location in heap memory. I can't use period or -> operators on it for some reason. So I can't use substr, regex, etc. I could make a copy, but do I really need to do that?
Here's my code so far:
BOOL pageActive = FALSE;
BOOL xml = FALSE;
#define MAXBUFFERSIZE 1024
#define MAXTAGSIZE 64
DWORD windowStart = 0; DWORD windowEnd = 15; DWORD windowSize = 15; // buffer window containing tag candidate
wstring windowCopy;
DWORD bufferSize = MAXBUFFERSIZE;
_int64 fileRemaining;
HANDLE hFile;
DWORD dwBytesRead = 0;
OVERLAPPED ol = { 0 };
LARGE_INTEGER dwPosition;
TCHAR* buffer;
hFile = CreateFile(
inputFilePath, // file to open
GENERIC_READ, // open for reading
FILE_SHARE_READ | FILE_SHARE_WRITE, // share for reading and writing
NULL, // default security
OPEN_EXISTING, // existing file only
FILE_ATTRIBUTE_NORMAL, // normal file | FILE_FLAG_OVERLAPPED
NULL); // no attr. template
if (hFile == INVALID_HANDLE_VALUE)
{
DisplayErrorBox((LPWSTR)L"CreateFile");
return 0;
}
LARGE_INTEGER size;
GetFileSizeEx(hFile, &size);
_int64 fileSize = (__int64)size.QuadPart;
double gigabytes = fileSize * 9.3132e-10;
sendToReportWindow(L"file size: %lld bytes \(%.1f gigabytes\)\n", fileSize, gigabytes);
if(fileSize > MAXBUFFERSIZE)
{
TCHAR* buffer = new TCHAR[MAXBUFFERSIZE]; buffer[0] = 0;
//sendToReportWindow(L"buffer is MAXBUFFERSIZE\n");
}
else
{
TCHAR* buffer = new TCHAR[fileSize]; buffer[0] = 0;
//sendToReportWindow(L"buffer is fileSize + 1\n");
}
fileRemaining = fileSize;
sendToReportWindow(L"file remaining: %lld bytes\n", fileRemaining);
//TCHAR readBuffer[MAXBUFFERSIZE] = { 0 };
while (fileRemaining) // outer loop. while file remaining, read file chunk to buffer
{
if (bufferSize > fileRemaining) // as fileremaining gets smaller as file is processed, it eventually is smaller than the buffer
bufferSize = fileRemaining;
if (FALSE == ReadFile(hFile, buffer, bufferSize -1, &dwBytesRead, NULL))
//if (FALSE == ReadFile(hFile, readBuffer, bufferSize -1, &dwBytesRead, NULL))
{
sendToReportWindow(L"file read failed\n");
CloseHandle(hFile);
return 0;
}
fileRemaining -= bufferSize; //fileRemaining is size of the file left after this buffer is processed
sendToReportWindow(L"outer loop\n");
// declare and clear span char array[maxTagSize] // size of array is maximum tag size (64). This is for unused windows. Raw text is not considered a tag
while (windowEnd < bufferSize) //inner loop. while unused data remains in buffer
{
windowSize = windowEnd - windowStart;
// windowsize += span.size
// The window start position remains fixed as the window size is slowly increased. Once it is large enough, some conditional below begin to look at it.If any triggers, they eat that window. Setting the new start position at the previous end position.
// If the buffer ends mid - tag, the contents of the window are copy to the span array variable
// Page state. Tags in header
// If !pageActive
// if windowSize > 7 (warc / 1.0)
// Convert chunk to string for regex ? (prepend span array from previous loop)
// If Regex chunk WARC - Type : response pageActive = true; wstart = wend, clear span
// Elseif regex chunk other warc - type clear span; skip ahead 550 for start, 565 for end
// Continue
// // page is active
//
// if windowSize > 6
// If regex chunk WARC / \d pageActive = false; xml = false; wstart = wend, clear span; Continue
// If !xml
// If windowSize > 15 (warc date)
// Convert chunk to string for regex ? (prepend span array from previous loop)
// If regex chunk warc date output warc date; wstart = wend, clear span
// elseIf regex chunk warc uri output warc uri; wstart = wend, clear span; skip ahead 300
// ElseIf end of window has \n“ < ” Xml = true // any window size where xml is not started
// continue // whatever triggers in this !xml block, always continue
// // page and xml are active
// // only send to output bare text when a [^\n]< or newline is reached
// test where just outputs all the tags or text it finds
// pull out any <.+> sequences or any >.+< sequences
// multibyte conversion, build string of window
//LPCCH readBuffer = { "ab" }; // = buffer[2];
// std::string str2 = str.substr (3,5);
//wstring sub = (wstring)readBuffer.substr(0,5); // substring of buffer
wstring sub = buffer->substr(windowStart, windowSize);
TCHAR converted[64] = { 0 };
MultiByteToWideChar(CP_ACP, MB_COMPOSITE, (LPCCH)&sub, -1, converted, MAXBUFFERSIZE);
//MultiByteToWideChar(CP_ACP, MB_COMPOSITE, (LPCCH)buffer, MAXBUFFERSIZE, converted, 1); // convert between the utf encoding of the file to the utf encoding of windows?
sendToReportWindow(L"windowStart:%d windowEnd:%d char:%s\n", windowStart, windowEnd, converted);
//sendToReportWindow((LPWSTR)buffer[windowStart]);
windowStart = windowEnd;
// //Tags in body. Any chunk size
// Convert chunk to string for regex ? (prepend span array from previous loop)
// if regex chunk tag pattern output pattern, wstart = wend, clear span
// nested tags? no
// windowEnd++; // tests above did not bite. so increment end of window, increasing window size
} // inner loop: while windowEnd <buffersize
// end of buffer: load any unused window into span
//If windowEnd != windowStart // window start did not get set to end by regex above
//Span = buffer(start – end)
//file progress indicator
//fileSize / fileRemaining x 0.01 // calculate percentage of file remaining with each buffer load
//print progress
//windowStart = 0; windowEnd = 1; windowSize = 1 // look at smaller pieces after first iteration (not in w header)
} // outer loop. while fileRemaining
delete buffer;
Which give me the error, "expression must have a pointer to class type".
TCHAR
has no such method as substr
.
modify:
wstring str(buffer);
wstring sub = str.substr(windowStart, windowSize);
Other codes that need to be modified:
MultiByteToWideChar(CP_ACP, MB_COMPOSITE, (LPCCH)&sub, -1, converted, MAXBUFFERSIZE);
sendToReportWindow(L"windowStart:%d windowEnd:%d char:%s\n", windowStart, windowEnd, converted);
=> sendToReportWindow(L"windowStart:%d windowEnd:%d char:%s\n", windowStart, windowEnd, sub.c_str()); //use string::c_str method
buffer = new TCHAR[MAXBUFFERSIZE]; buffer[0] = 0; //remove TCHAR*
buffer = new TCHAR[fileSize]; buffer[0] = 0; //remove TCHAR*
I am not processing all the xml, some will be skipped. grabbing the text and some of the warc headers, skipping the rest.
You can use string::find
to grab the warc header.(Make sure the warc header is unique)
ep: Check if a string contains a string in C++
BTW, whether you use Unicode Character or Multi-Byte Character, you need to maintain a single encoding format.