Search code examples
phppdfms-wordlibreoffice

LibreOffice convert PDF to Word as textbox instead normal document


I want to convert PDF to Microsoft Word (doc, docx) from Ubuntu 18 terminal using LibreOffice 6.1.3.2 10(Build:2) (actually I execute LibreOffice using PHP). But I got full of textbox document instead normal Word document.

First to understand my problem I suggest to download my file in here: https://nofile.io/f/DKvQYFRdYZg/pdf2word.rar

i have 4 file:

1.original.doc
2.original-to-pdf.pdf
3.pdf-to-word.doc
4.expected.doc

First I convert original.pdf to original-to-pdf.pdf, then I try convert back to Word using this following command:

soffice --infilter="writer_pdf_import" --convert-to docx a.pdf

File creation was success but all content is converted to Textbox not as normal document. Then I try several PDF to Word converter like ilovepdf.com and I got expected.doc

You can see the different by download my file in link above or see image below

my output:

enter image description here

ilovepdf output:

enter image description here

I try several filter include pdf to odt then odt to word but all command below not give me expected result

soffice --infilter="writer_pdf_import" --convert-to docx a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"Microsoft Word 2007/2010/2013 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc a.pdf
soffice --infilter="writer_pdf_import" --convert-to odf:"writer8" a.pdf
soffice --infilter="writer8" --convert-to doc a.odf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 95" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 97" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"StarOffice XML (Writer)" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"MS Word 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 2007 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"MS Word 2007 XML Template" a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"MS Word 2007 XML" a.pdf
soffice --infilter="Microsoft Word 2007/2010/2013 XML" --convert-to doc a.pdf

I know about premium software like abbyy cloud or adobe cloud, but I don't think website like ilovepdf will use paid service to provide free service. My question is, have I missed something in LibreOffice dependency to be able convert PDF to normal word document?


Solution

  • Your problem lies with the software used to create the PDF; output in the form of textboxes in a PDF is a characteristic of certain low-end PDF-creation software. There is nothing Word can do about that during the import process; you would need to clean it up afterwards.

    A Word macro you could use for the clean-up is:

    Sub EraseTextBoxes()
    Dim RngDoc As Range, RngShp As Range, i As Long
    With ActiveDocument
      For i = .Shapes.Count To 1 Step -1
        With .Shapes(i)
          If .Type = msoTextBox Then
            Set RngShp = .TextFrame.TextRange
            RngShp.End = RngShp.End - 1
            Set RngDoc = .Anchor
            RngDoc.Collapse wdCollapseEnd
            RngDoc.FormattedText = RngShp.FormattedText
            .Delete
          End If
        End With
      Next
    End With
    End Sub
    

    Do note that whether the macro positions the output correctly depends on where the textboxes are anchored; if the anchor positions are unrelated to the textbox locations, you'll end up with a dog's breakfast. You'll probably still also end up with each line as its own paragraph. To clean up such content, see http://www.msofficeforums.com/word/29880-cleaning-up-text-pasted-websites-e-mails.html