I wrote a simple code that reads a very large file into memory. (The file is around 480 mega bytes in size). The file contains some comma separated values of 0s and 1s. The code is fairly straight forward. I first get the file size, then allocate enough buffer space, read the file, separate by comma and just put it in the array. The program is as follows:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(){
long no_of_houses = 1048576L; //dimensions of my final table.
int no_of_appliances = 5;
int no_of_sectors = 48;
int* intended_schedule; // this is where the table will be stored.
intended_schedule = (int*) malloc(no_of_houses * no_of_appliances * no_of_sectors * sizeof(int));
FILE* fptr = fopen("./data/houses.csv", "r"); //this file is around 480 mega bytes.
if(fptr == NULL){
perror("housese file");
exit(0);
}
fseek(fptr, 0L, SEEK_END); //find the size of the file before allocating space
long size = ftell(fptr);
rewind(fptr);
char* buffer = (char*) calloc(1, size); //now we know the size, we can allocate space.
fread(buffer, size, 1, fptr);
char* token = strtok(buffer, ",\n"); //it's a comma separated file. So break from comma
long no = 0;
while(token != NULL){
if(no == no_of_houses*no_of_appliances*no_of_sectors)
break; //guard against unexpectedly big data file.
intended_schedule[no] = token[0] - 48;// it's either 0 or 1. So this is good enough
no++;
token = strtok(NULL, ",\n");
}
fclose(fptr);
free(intended_schedule);
free(buffer);
return 0;
}
I used this code as a function of a bigger program and since it gave me errors, I ran this program through valgrind. This is the result I got:
goodman@node2 analyse_code]$ valgrind ./analyse
==39263== Memcheck, a memory error detector
==39263== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==39263== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==39263== Command: ./analyse
==39263==
==39263== Warning: set address range perms: large range [0x51f8040, 0x411f8040) (undefined)
==39263== Warning: set address range perms: large range [0x59e3f040, 0x77e3f040) (defined)
==39263== Warning: set address range perms: large range [0x59e3f040, 0x77e3f040) (defined)
==39263== Invalid read of size 1
==39263== at 0x4EBEDCC: strtok (in /usr/lib64/libc-2.17.so)
==39263== by 0x400997: main (analyse.c:36)
==39263== Address 0x77e3f040 is 0 bytes after a block of size 503,316,480 alloc'd
==39263== at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==39263== by 0x400904: main (analyse.c:27)
==39263==
==39263== Invalid read of size 1
==39263== at 0x4EBEDFC: strtok (in /usr/lib64/libc-2.17.so)
==39263== by 0x400997: main (analyse.c:36)
==39263== Address 0x77e3f040 is 0 bytes after a block of size 503,316,480 alloc'd
==39263== at 0x4C2B9B5: calloc (vg_replace_malloc.c:711)
==39263== by 0x400904: main (analyse.c:27)
==39263==
==39263== Warning: set address range perms: large range [0x51f8028, 0x411f8058) (noaccess)
==39263== Warning: set address range perms: large range [0x59e3f028, 0x77e3f058) (noaccess)
==39263==
==39263== HEAP SUMMARY:
==39263== in use at exit: 0 bytes in 0 blocks
==39263== total heap usage: 3 allocs, 3 frees, 1,509,950,008 bytes allocated
==39263==
==39263== All heap blocks were freed -- no leaks are possible
==39263==
==39263== For counts of detected and suppressed errors, rerun with: -v
==39263== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
I'm wondering why I get these errors. As far as I can tell, there are no problems with my code. Is it because my data is too large? I don't think that could be the case since I run this code on a server with 128 GB of RAM.
Any help would be appreciated.
--ppgoodman
strtok()
assumes a NUL-terminated string, your buffer is NOT NUL-terminated, so strtok() will try to walk beyond the end of your buffer. But you can do withoutstrtok()
and the large buffer.
You don't need to buffer the entire file; for simple cases like this, you can step through it using a one-character buffer. This will consume less memory and will also be consirably faster (at least 2 times)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(){
unsigned long no_of_houses = 1048576L; //dimensions of my final table.
unsigned int no_of_appliances = 5;
unsigned int no_of_sectors = 48;
unsigned long no = 0;
int ch;
unsigned int *intended_schedule; // this is where the table will be stored.
intended_schedule = malloc(no_of_houses * no_of_appliances * no_of_sectors * sizeof *intended_schedule);
FILE *fptr = fopen("./data/houses.csv", "r"); //this file is around 480 mega bytes.
if(!fptr) {
perror("housese file");
exit(0);
}
while(no < no_of_houses*no_of_appliances*no_of_sectors) {
ch = getc(fptr);
if (ch== EOF) break;
if (ch== '\n') continue;
if (ch== ',') continue;
intended_schedule[no++] = ch - '0'; // it's either 0 or 1. So this is good enough
}
fclose(fptr);
free(intended_schedule);
return 0;
}