Search code examples
python-3.xpypdf

How to turn a pdf into a .docx file using python 3 and PyPDF2 (or any other way)?


I want to convert a .pdf into a .docx file. I have tried a few ways, but this is the one which seems best (correct me if I am wrong). I have seen this SO question, but it didn't work for me - it is the same as this:

import PyPDF2

path=r"C:\Users\name\Desktop\test maker tester\Computer Science\414838-2020-specimen-paper-1.pdf"
text=""
pdf_file = open(path, 'rb')
text =""
read_pdf = PyPDF2.PdfFileReader(pdf_file)
c = read_pdf.numPages
for i in range(c):
    page = read_pdf.getPage(i)
    text+=(page.extractText())

It does not give me an error, but I can't find any Word document, and the PDF is still there...

Do you know how to fix this, or can suggest any other way to turn a .pdf into a .docx file?


Solution

  • You do not have a direct way or a package in python which converts pdf to docx seamlessly. The method that you tried will convert a pdf to docx but all the formatting of the document would be removed and you would only get plain text in the converted docx without the styles.

    I have personally tried the Adobe's Document cloud SDK through python which converts pdf to docx by preserving the original native formatting of the pdf document. It takes about 15 secs per document to convert. You can find more information on how to get started using the below links:

    https://github.com/adobe/dc-view-sdk-samples

    https://www.adobe.io/apis/documentcloud/dcsdk/docs.html

    As for the question of using this service through python, you have to use subprocess or os.system commands to invoke the command line commands of this service.

    Update:

    You can find a detailed explanation of the implementation of this method here Link. Although this is for OCR conversion, the exact same process would work for converting a pdf to docx.