python python-3.x pypdf pdfminer pdf-extraction

How to check if PDF is scanned image or contains text

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.

Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files?

environment: PYTHON 3.6

Solution

The below code will work to extract data text data from both searchable and non-searchable PDF's.

import fitz

text = ""
path = "Your_scanned_or_partial_scanned.pdf"

doc = fitz.open(path)
for page in doc:
    text += page.get_text()()

You can refer to this link for more information.

If you don't have the fitz module you need to do this:

pip install --upgrade pymupdf