Search code examples
pythonpython-3.xbashpdftk

PFTK command called by Python does not work properly


My goal is to extract multiple PDF pages from a website, which makes them available in their own viewer, and merge them into a single PDF file, keeping the original order. Therefore, I am saving each of the extracted pages to a temporary directory using the tempfile library:

def save_publication_page_to_tempfile(
    publication_page,
    page_number,
    directory
):
    temp_pdf = tempfile.NamedTemporaryFile(
        prefix=f'{page_number}_',
        suffix='.pdf',
        dir=directory,
        delete=False
    )
    temp_pdf.write(publication_page)

    return temp_pdf.name

After saving each of the extracted pages, the files are merged using the pdftk tool:

def merge_pdf_files(self, publication_metadata, output_filename):
        with tempfile.TemporaryDirectory() as temp_dir:
            for publication in publication_metadata:
                save_publication_page_to_tempfile(
                    publication['content'],
                    publication['page_number'],
                    temp_dir
                )

            command = (
                f"pdftk $(ls {temp_dir}/* | sort -n -t _ -k 1) "
                f"cat output {os.path.join('/tmp', output_filename)}"
            )
            os.system(command)

        if os.path.exists(os.path.join('/tmp', output_filename)):
            return os.path.join('/tmp', output_filename)
        else:
            return None

However, the merging done does not follow the desired order. I noticed that when I stop execution with pdb.set_trace () before the conversion command and then execute the same command directly in the directory used, the generated PDF follows the desired order:

pdftk $(ls * | sort -n -t _ -k 1) cat output result.pdf

Finally, I would like to know some possible reasons why the generated PDF is in different order comparing Python script execution and BASH command execution right in the temporary directory where the PDF files are.


Solution

  • The following changes to save_publications_to_tempfile solved my problem:

    def save_publication_page_to_tempfile(
        publication_page,
        page_number,
        directory
    ):
        formatted_page_number = str(page_number).zfill(6)
        temp_pdf = tempfile.NamedTemporaryFile(
            prefix=f'{formatted_page_number}_',
            suffix='.pdf',
            dir=directory,
            delete=False
        )
        temp_pdf.write(publication_page)
    
        return temp_pdf.name