Search code examples
pythonpdfexport-to-csvmathematical-expressionspython-pdfreader

How to extract some mathematical expressionfrom pdf using python?


I have a pdf which has math equations like this

I am trying to extract the objective questions from a pdf file and convert them into csv file using python in such a way that each row of table contain a question, four options in each column and a correct option (so total six columns). But that pdf also have those mathematical equations which I can't write them into csv file as they are . Is it possible to write those equations in my csv file as they are in pdf file ?


Solution

  • This depends on how the formula is represented in PDF. It can be either XObject, inline image or unicode text.

    Try pdfreader. It can extract plain texts, texts containing PDF commands and images from PDF documents.

    from pdfreader import SimplePDFViewer, PageDoesNotExist
    
    fd = open(you_pdf_file_name, "rb")
    viewer = SimplePDFViewer(fd)
    
    plain_text = ""
    pdf_markdown = ""
    images = []
    try:
        while True:
            viewer.render()
            pdf_markdown += viewer.canvas.text_content
            plain_text += "".join(viewer.canvas.strings)
            images.extend(viewer.canvas.inline_images)
            images.extend(viewer.canvas.images.values())
            viewer.next()
    except PageDoesNotExist:
        pass