Search code examples
pythonfractionspdftotext

How to import Mixed Fraction from Pdf using Python?


I am new to python. I am trying to extract mixed fractions from pdf file using Python. But I have no idea which tool I should use to extract. My sample pdf contains only one page with simple text. I would like to extract Part name and length of part using Python. Screenshot of sample pdf page is as shown in image link Page 1 of Pdf- Screenshot. Pdf file can be downloaded from the following link (Sample Pdf)

EDIT 1: - UPDATED

Thank you for suggesting Pdfplumber. It is a great tool. I could extract information with it. Though in some cases, when I extract length, I get the whole number combined with denominator. Say, if I have 36 1/2 as length (as shown in screenshot), then I get the value as 362 inches.

import pdfplumber
with pdfplumber.open("Sample.pdf") as pdf:
  first_page = pdf.pages[0]
  text = first_page.extract_text()
  for row in text.split('\n'):
        if 'inches' in row:
            num = row.split()[0]
            print(num)

Output: 362

This code works for me in most cases. Just in some cases, I get 362 as my output, instead of getting 36 as a separate value. How could I resolve this issue?


Solution

  • I would suggest to use PDF Pluber, it's a very powerful and well documented tool for extracting text, table, images from PDFs. Moreover, it has a very convenient function, called crop, that allows you to crop and extract just the portion of the page that you need.

    Just as an example, the code would be something like this (note that this will work with any number of pages):

    filename = 'path/to/your/PDF'
    crop_coords = [x0, top, x1, bottom]
    text = ''
    pages = []
    with pdfplumber.open(filename) as pdf:
        for i, page in enumerate(pdf.pages):
            my_width = page.width
            my_height = page.height
            # Crop pages
            my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
            page_crop = page.crop(bbox=my_bbox)
            text = text+str(page_crop.extract_text()).lower()
            pages.append(page_crop)
    

    Here is the explanation of coords:

    x0 = % Distance from left vertical cut to left side of page.
    top = % Distance from upper horizontal cut to upper side of page.
    x1 = % Distance from right vertical cut to right side of page.
    bottom = % Distance from lower horizontal cut to lower side of page.