Regarding Splitting PDF and OCR Recognition

I have a lot of pdf documents which are scanned versions of writings. I need to split a single page inside a pdf.

For example If there are 1 Page. I need to split the one page to header section, footer section, main body and side sections.

What programming language and library gives me most flexibility to do such a task without me doing all the grunt work. I'm familiar with Python. I know about Python's PDF & OCR libraries but I couldn't find anything about splitting a single page.

Then finally would like to pass the spitted sections of pdf page to OCR to recognize the characters and the output to a csv or text file.

Thanking you in advance....

Solution

To split the pages in a pretty simpel way, I would suggest to use PDF Pluber, it's a very powerful and well documented tool for extracting text, table, images from PDFs. Moreover, it has a very convenient function, called crop, that allows you to crop and extract just the portion of the page that you need.

Just as an example, the code would be something like this (note that this will work with any number of pages):

filename = 'path/to/your/PDF'
crop_coords = [x0, top, x1, bottom]
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
    for i, page in enumerate(pdf.pages):
        my_width = page.width
        my_height = page.height
        # Crop pages
        my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
        page_crop = page.crop(bbox=my_bbox)
        text = text+str(page_crop.extract_text()).lower()
        pages.append(page_crop)

Here is the explanation of coords:

x0 = % Distance from left vertical cut to left side of page.
top = % Distance from upper horizontal cut to upper side of page.
x1 = % Distance from right vertical cut to right side of page.
bottom = % Distance from lower horizontal cut to lower side of page.