Search code examples
cfile-iocompressionbzip2

How do I extract all the data from a bzip2 archive with C?


I have a concatenated file made up of some number of bzip2 archives. I also know the sizes of the individual bzip2 chunks in that file.

I would like to decompress a bzip2 stream from an individual bzip2 data chunk, and write the output to standard output.

First I use fseek to move the file cursor to the desired archive byte, and then read the "size"-chunk of the file into a BZ2_bzRead call:

int headerSize = 1234;
int firstChunkSize = 123456;
FILE *fp = fopen("pathToConcatenatedFile", "r+b");
char *bzBuf = malloc(sizeof(char) * firstChunkSize);
int bzError, bzNBuf;
BZFILE *bzFp = BZ2_bzReadOpen(&bzError, *fp, 0, 0, NULL, 0);

# move cursor past header of known size, to the first bzip2 "chunk"
fseek(*fp, headerSize, SEEK_SET); 

while (bzError != BZ_STREAM_END) {
    # read the first chunk of known size, decompress it
    bzNBuf = BZ2_bzRead(&bzError, bzFp, bzBuf, firstChunkSize);
    fprintf(stdout, bzBuf);
}

BZ2_bzReadClose(&bzError, bzFp);
free(bzBuf);
fclose(fp);

The problem is that when I compare the output of the fprintf statement with output from running bzip2 on the command line, I get two different answers.

Specifically, I get less output from this code than from running bzip2 on the command line.

More specifically, my output from this code is a smaller subset of the output from the command line process, and I am missing what is in the tail-end of the bzip2 chunk of interest.

I have verified through another technique that the command-line bzip2 is providing the correct answer, and, therefore, some problem with my C code is causing output at the end of the chunk to go missing. I just don't know what that problem is.

If you are familiar with bzip2 or libbzip2, can you provide any advice on what I am doing wrong in the code sample above? Thank you for your advice.


Solution

  • This is my source code:

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    #include <bzlib.h>
    
    int
    bunzip_one(FILE *f) {
      int bzError;
      BZFILE *bzf;
      char buf[4096];
    
      bzf = BZ2_bzReadOpen(&bzError, f, 0, 0, NULL, 0);
      if (bzError != BZ_OK) {
        fprintf(stderr, "E: BZ2_bzReadOpen: %d\n", bzError);
        return -1;
      }
    
      while (bzError == BZ_OK) {
        int nread = BZ2_bzRead(&bzError, bzf, buf, sizeof buf);
        if (bzError == BZ_OK || bzError == BZ_STREAM_END) {
          size_t nwritten = fwrite(buf, 1, nread, stdout);
          if (nwritten != (size_t) nread) {
            fprintf(stderr, "E: short write\n");
            return -1;
          }
        }
      }
    
      if (bzError != BZ_STREAM_END) {
        fprintf(stderr, "E: bzip error after read: %d\n", bzError);
        return -1;
      }
    
      BZ2_bzReadClose(&bzError, bzf);
      return 0;
    }
    
    int
    bunzip_many(const char *fname) {
      FILE *f;
    
      f = fopen(fname, "rb");
      if (f == NULL) {
        perror(fname);
        return -1;
      }
    
      fseek(f, 0, SEEK_SET);
      if (bunzip_one(f) == -1)
        return -1;
    
      fseek(f, 42, SEEK_SET); /* hello.bz2 is 42 bytes long in my case */
      if (bunzip_one(f) == -1)
        return -1;
    
      fclose(f);
      return 0;
    }
    
    int
    main(int argc, char **argv) {
      if (argc < 2) {
        fprintf(stderr, "usage: bunz <fname>\n");
        return EXIT_FAILURE;
      }
      return bunzip_many(argv[1]) != 0 ? EXIT_FAILURE : EXIT_SUCCESS;
    }
    
    • I cared very much about proper error checking. For example, I made sure that bzError was BZ_OK or BZ_STREAM_END before trying to access the buffer. The documentation clearly says that for other values of bzError the returned number is undefined.
    • It shouldn't frighten you that about 50 percent of the code are concerned with error handling. That's how it should be. Expect errors everywhere.
    • The code still has some bugs. In case of errors it doesn't release the resources (f, bzf) properly.

    And these are the commands I used for testing:

    $ echo hello > hello
    $ echo world > world
    $ bzip2 hello
    $ bzip2 world
    $ cat hello.bz2 world.bz2 > helloworld.bz2
    $ gcc -W -Wall -Os -o bunz bunz.c -lbz2
    $ ls -l *.bz2
    -rw-r--r-- 1 roland None 42 Oct 12 09:26 hello.bz2
    -rw-r--r-- 1 roland None 86 Oct 12 09:36 helloworld.bz2
    -rw-r--r-- 1 roland None 44 Oct 12 09:26 world.bz2
    $ ./bunz.exe helloworld.bz2 
    hello
    world