How to extract images, video and audio from a pdf file using python

I need a python program that can extract videos audio and images from a pdf. I have tried using libraries such as PyPDF2 and Pillow, but I was unable to get all three to work let alone one.

Solution

@George Davis-Diver can you please let me have an example PDF with video?

Sounds and videos are embedded in their specific annotation types. Both are no FileAttachment annotation, so the respective mathods cannot be used.

For a sound annotation, you must use `annot.get_sound()`` which returns a dictionary where one of the keys is the binary sound stream.

Images on the other hand may for sure be embedded as FileAttachment annotations - but this is unusual. Normally they are displayed on the page independently. Find out a page's images like this:

import fitz
from pprint import pprint
doc=fitz.open("your.pdf")
page=doc[0]  # first page - use 0-based page numbers
pprint(page.get_images())
[(1114, 0, 1200, 1200, 8, 'DeviceRGB', '', 'Im1', 'FlateDecode')]
# extract the image stored under xref 1114:
img = doc.extract_image(1114)

This is a dictionary with image metadata and the binary image stream. Note that PDF stores transparency data of an image separately, which therefore needs some additional care - but let us postpone this until actually happening.

Extracting video from RichMedia annotations is currently possible in PyMuPDF low-level code only. @George Davis-Diver - thanks for example file! Here is code that extracts video content:

import sys
import pathlib
import fitz

doc = fitz.open("vid.pdf")  # open PDF
page = doc[0]   # load desired page (0-based)
annot = page.first_annot  # access the desired annot (first one in example)
if annot.type[0] != fitz.PDF_ANNOT_RICH_MEDIA:
    print(f"Annotation type is {annot.type[1]}")
    print("Only support RichMedia currently")
    sys.exit()
cont = doc.xref_get_key(annot.xref, "RichMediaContent/Assets/Names")
if cont[0] != "array":  # should be PDF array
    sys.exit("unexpected: RichMediaContent/Assets/Names is no array")
array = cont[1][1:-1]  # remove array delimiters

# jump over the name / title: we will get it later
if array[0] == "(":
    i = array.find(")")
else:
    i = array.find(">")
xref = array[i + 1 :]  # here is the xref of the actual video stream
if not xref.endswith(" 0 R"):
    sys.exit("media contents array has more than one entry")

xref = int(xref[:-4])  # xref of video stream file
video_filename = doc.xref_get_key(xref, "F")[1]
video_xref = doc.xref_get_key(xref, "EF/F")[1]
video_xref = int(video_xref.split()[0])
video_stream = doc.xref_stream_raw(video_xref)
pathlib.Path(video_filename).write_bytes(video_stream)