Context of the Problem: I have a large binary file containing data with a unique structure. A unit of this data is called an "event". Each event has 32016 bytes and a single file includes about 400000 events making the file ~12 GBs. I'm writing a program to process the events and trying to use a multithread approach with several threads reading different segments of the file(having each tread use its own file stream).
The problem is fseek fails to seek to the correct position of the file. the following is the minimal reproducible example. The program reads a binary file with 473797 events with planning to use 20 treads while each tread uses a different file stream.
#include <iostream>
#include <stream>
#include <errno.h>
#include <string.h>
using namespace std;
int main(){
FILE *segment[20];
int ret=0;
int eventsPerThread=473797/20;
int eventSize=8004;
for(int k=0;k<20;++k){
segment[k]=fopen("Datafile_367.bin","rb");
if(segment[k]==NULL){
std::cout<<"file stream is NULL!"<<k<<"\n";
}
ret=fseek(segment[k],eventsPerThread*eventSize*4*k,SEEK_SET);
std::cout<<ret<<":::"<<strerror(errno)<<"\n";
}
return 0;
}
The following is the output. fseek is successful sometimes and returns 0 while failing at other times with the error code 22(Invalid argument).
0:::Success
0:::Success
0:::Success
-1:::Invalid argument
-1:::Invalid argument
-1:::Invalid argument
0:::Invalid argument
0:::Invalid argument
0:::Invalid argument
-1:::Invalid argument
-1:::Invalid argument
-1:::Invalid argument
0:::Invalid argument
0:::Invalid argument
0:::Invalid argument
-1:::Invalid argument
-1:::Invalid argument
0:::Invalid argument
0:::Invalid argument
0:::Invalid argument
Any explanations for this behavior of the fseek() function?
(Note that the minimal reproducible example is a single tread. multithreading will happen once the program starts to read the events)
The error is the overflow in your offset calculation. You use int
, which is apparently 4 bytes wide. INT_MAX
is 2147483647 for this width.
Let's see:
k | eventsPerThread * eventSize * 4 * k |
overflowed int |
return value of fseek() |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 758427024 | 758427024 | 0 |
2 | 1516854048 | 1516854048 | 0 |
3 | 2275281072 | -2019686224 | -1 |
4 | 3033708096 | -1261259200 | -1 |
5 | 3792135120 | -502832176 | -1 |
6 | 4550562144 | 255594848 | 0 |
7 | 5308989168 | 1014021872 | 0 |
: | : | : | : |
The resulting int
becomes negative because of the overflow, and fseek()
is not happy with that.
First, make sure your long
s are more than 4 bytes wide. Then change at least one operand of your multiplication to long
. For example like this eventsPerThread * eventSize * 4L * k
.
Final note: Consider to use more spaces to make your code more readable.