I'm trying to read a huge dataset of 20 millions lines, in each line there is a huge number (in fact I'm storing the number in unsigned long long
variables), for example: 1774251443, 8453058335, 19672843924
, and so on...
I develop a simple function to do this, I'll show below
void read(char pathToDataset[], void **arrayToFill, int arrayLength) {
FILE *dataset = fopen(pathToDataset, "r");
if (dataset == NULL ) {
printf("Error while opening the file.\n");
exit(0); // exit failure, it closes the program
}
int i = 0;
/* Prof. suggestion: do a malloc RIGHT HERE, for allocate a
* space in memory in which store the element
* to insert in the array
*/
while (i < arrayLength && fscanf(dataset, "%llu", (unsigned long long *)&arrayToFill[i]) != EOF) {
// ONLY FOR DEBUG, it will print
//printf("line: %d.\n", i); 20ML of lines!
/* Prof. suggestion: do another malloc here,
* for each element to be read.
*/
i++;
}
printf("Read %d lines", i);
fclose(dataset);
}
the parameter arrayToFill
is of type void**
because of the exercise goal. Every function has to perform on generic type, and the array could potentially be filled with every type of data (in this example, huge numbers, but it could contain huge strings, integers and so on...).
I don't understand why I have to do 2 malloc
calls, isn't a single one enough?
For your first question, think of malloc
as a call for memory to store a number of N objects, all of which are size S. When you have the parameters void ** arrayToFill, int arrayLength
, you are saying this array will contain arrayLength
amount of pointers of the size sizeof(void*)
. That is the first allocation and call to malloc.
But the members of that array are pointers, which are meant to hold arrays or essentially memory of some other object themselves. The first call to malloc only allocates memory to store the void*
of each array member, but the memory for each individual member of the array needs it's own malloc()
call.
Efficient Line Reading
For your other question, making lots of small allocations of memory, and then later on freeing them (assuming you would do so, otherwise you would leak a lot of memory), is very slow. However, the performance hit for I/O related tasks is more based on the number of calls than it is for the amount of memory you are allocating.
Have your program read the entire file into memory, and allocate an array of unsigned long long
for 20 million, or however many integers you expect to handle. This way, you can parse through the file contents, use strtol
function from <stdlib.h>
, and one by one copy the resulting long to your large array.
This way, you only use a 2-3 large memory allocations and deallocations.