Search code examples
pdfhdf5

Is there a good way to store a lot of small PDF files?


The nature of my job requires me to make a lot of PDF images of data that I am analyzing. At the end of the day I only use maybe 10% of the images as a "proof" of concept but I still want to save all of the images in case people want to scrutinize my work.

I am thinking of something like, storing the PDF files in an hdf5 file but as far as I am aware this is not possible (my only interface with hdf5 files are through the h5py module in python).

Do you guys have any recommendation?


Solution

  • It is possible to store (many) PDF files within an HDF5 file. One way to solve this could be to create a dataset for each PDF, and have the dataset be of data type opaque of one dimension with a size equal to the size of the PDF file. If you are not bound to a specific technology, you could solve your use-case using HDFql as follows:

    # import HDFql package
    import HDFql
    
    # create an HDF5 file named 'pdf.h5' and use (i.e. open) it
    HDFql.execute("CREATE AND USE FILE pdf.h5")
    
    # get all files contained in root directory '/my_dir' which, for the sake
    # of this example, contains the PDFs to store in the HDF5 file
    HDFql.execute("SHOW FILE /my_dir/")
    
    i = 0
    while HDFql.cursor_next() == HDFql.SUCCESS:
    
        # get name of PDF file
        file_name = HDFql.cursor_get_char()
    
        # create dataset containing the content of the PDF file
        HDFql.execute("CREATE DATASET dset_%d VALUES FROM BINARY FILE \"/my_dir/%s\"" % (i, file_name))
    
        i += 1