Search code examples
cfileiofgetsfile-processing

C : Best way to go to a known line of a file


I have a file in which I'd like to iterate without processing in any sort the current line. What I am looking for is the best way to go to a determined line of a text file. For example, storing the current line into a variable seems useless until I get to the pre-determined line.

Example :

file.txt

foo
fooo
fo
here

Normally, in order to get here, I would have done something like :

FILE* file = fopen("file.txt", "r");
if (file == NULL)
    perror("Error when opening file ");
char currentLine[100];
while(fgets(currentLine, 100, file))
{
    if(strstr(currentLine, "here") != NULL)
         return currentLine;
}

But fgetswill have to read fully three line uselessly and currentLine will have to store foo, fooo and fo.

Is there a better way to do this, knowing that here is line 4? Something like a go tobut for files?


Solution

  • You cannot access directly to a given line of a textual file (unless all lines have the same size in bytes; and with UTF8 everywhere a Unicode character can take a variable number of bytes, 1 to 6; and in most cases lines have various length - different from one line to the next). So you cannot use fseek (because you don't know in advance the file offset).

    However (at least on Linux systems), lines are ending with \n (the newline character). So you could read byte by byte and count them:

    int c= EOF;
    int linecount=1;
    while ((c=fgetc(file)) != EOF) {
      if (c=='\n')
        linecount++;
    }
    

    You then don't need to store the entire line.

    So you could reach the line #45 this way (using while ((c=fgetc(file)) != EOF) && linecount<45) ...) and only then read entire lines with fgets or better yet getline(3) on POSIX systems (see this example). Notice that the implementation of fgets or of getline is likely to be built above fgetc, or at least share some code with it. Remember that <stdio.h> is buffered I/O, see setvbuf(3) and related functions.


    Another way would be to read the file in two passes. A first pass stores the offset (using ftell(3)...) of every line start in some efficient data structure (a vector, an hashtable, a tree...). A second pass use that data structure to retrieve the offset (of the line start), then use fseek(3) (using that offset).


    A third way, POSIX specific, would be to memory-map the file using mmap(2) into your virtual address space (this works well for not too huge files, e.g. of less than a few gigabytes). With care (you might need to mmap an extra ending page, to ensure the data is zero-byte terminated) you would then be able to use strchr(3) with '\n'

    In some cases, you might consider parsing your textual file line by line (using appropriately fgets, or -on Linux- getline, or generating your parser with flex and bison) and storing each line in a relational database (such as PostGreSQL or sqlite).

    PS. BTW, the notion of lines (and the end-of-line mark) vary from one OS to the next. On Linux the end-of-line is a \n character. On Windows lines are rumored to end with \r\n, etc...