Search code examples
pythonimagepdfaudiovideo

How to extract images, video and audio from a pdf file using python


I need a python program that can extract videos audio and images from a pdf. I have tried using libraries such as PyPDF2 and Pillow, but I was unable to get all three to work let alone one.


Solution

  • @George Davis-Diver can you please let me have an example PDF with video?

    Sounds and videos are embedded in their specific annotation types. Both are no FileAttachment annotation, so the respective mathods cannot be used.

    For a sound annotation, you must use `annot.get_sound()`` which returns a dictionary where one of the keys is the binary sound stream.

    Images on the other hand may for sure be embedded as FileAttachment annotations - but this is unusual. Normally they are displayed on the page independently. Find out a page's images like this:

    import fitz
    from pprint import pprint
    doc=fitz.open("your.pdf")
    page=doc[0]  # first page - use 0-based page numbers
    pprint(page.get_images())
    [(1114, 0, 1200, 1200, 8, 'DeviceRGB', '', 'Im1', 'FlateDecode')]
    # extract the image stored under xref 1114:
    img = doc.extract_image(1114)
    

    This is a dictionary with image metadata and the binary image stream. Note that PDF stores transparency data of an image separately, which therefore needs some additional care - but let us postpone this until actually happening.

    Extracting video from RichMedia annotations is currently possible in PyMuPDF low-level code only. @George Davis-Diver - thanks for example file! Here is code that extracts video content:

    import sys
    import pathlib
    import fitz
    
    doc = fitz.open("vid.pdf")  # open PDF
    page = doc[0]   # load desired page (0-based)
    annot = page.first_annot  # access the desired annot (first one in example)
    if annot.type[0] != fitz.PDF_ANNOT_RICH_MEDIA:
        print(f"Annotation type is {annot.type[1]}")
        print("Only support RichMedia currently")
        sys.exit()
    cont = doc.xref_get_key(annot.xref, "RichMediaContent/Assets/Names")
    if cont[0] != "array":  # should be PDF array
        sys.exit("unexpected: RichMediaContent/Assets/Names is no array")
    array = cont[1][1:-1]  # remove array delimiters
    
    # jump over the name / title: we will get it later
    if array[0] == "(":
        i = array.find(")")
    else:
        i = array.find(">")
    xref = array[i + 1 :]  # here is the xref of the actual video stream
    if not xref.endswith(" 0 R"):
        sys.exit("media contents array has more than one entry")
    
    xref = int(xref[:-4])  # xref of video stream file
    video_filename = doc.xref_get_key(xref, "F")[1]
    video_xref = doc.xref_get_key(xref, "EF/F")[1]
    video_xref = int(video_xref.split()[0])
    video_stream = doc.xref_stream_raw(video_xref)
    pathlib.Path(video_filename).write_bytes(video_stream)