I want to convert PDF to Microsoft Word (doc, docx) from Ubuntu 18 terminal using LibreOffice 6.1.3.2 10(Build:2) (actually I execute LibreOffice using PHP). But I got full of textbox document instead normal Word document.
First to understand my problem I suggest to download my file in here: https://nofile.io/f/DKvQYFRdYZg/pdf2word.rar
i have 4 file:
1.original.doc
2.original-to-pdf.pdf
3.pdf-to-word.doc
4.expected.doc
First I convert original.pdf
to original-to-pdf.pdf
, then I try convert back to Word using this following command:
soffice --infilter="writer_pdf_import" --convert-to docx a.pdf
File creation was success but all content is converted to Textbox not as normal document. Then I try several PDF to Word converter like ilovepdf.com and I got expected.doc
You can see the different by download my file in link above or see image below
my output:
ilovepdf output:
I try several filter include pdf to odt then odt to word but all command below not give me expected result
soffice --infilter="writer_pdf_import" --convert-to docx a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"Microsoft Word 2007/2010/2013 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc a.pdf
soffice --infilter="writer_pdf_import" --convert-to odf:"writer8" a.pdf
soffice --infilter="writer8" --convert-to doc a.odf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 95" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 97" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"StarOffice XML (Writer)" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"MS Word 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 2007 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to doc:"MS Word 2003 XML" a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"MS Word 2007 XML Template" a.pdf
soffice --infilter="writer_pdf_import" --convert-to docx:"MS Word 2007 XML" a.pdf
soffice --infilter="Microsoft Word 2007/2010/2013 XML" --convert-to doc a.pdf
I know about premium software like abbyy cloud
or adobe cloud
, but I don't think website like ilovepdf will use paid service to provide free service. My question is, have I missed something in LibreOffice dependency to be able convert PDF to normal word document?
Your problem lies with the software used to create the PDF; output in the form of textboxes in a PDF is a characteristic of certain low-end PDF-creation software. There is nothing Word can do about that during the import process; you would need to clean it up afterwards.
A Word macro you could use for the clean-up is:
Sub EraseTextBoxes()
Dim RngDoc As Range, RngShp As Range, i As Long
With ActiveDocument
For i = .Shapes.Count To 1 Step -1
With .Shapes(i)
If .Type = msoTextBox Then
Set RngShp = .TextFrame.TextRange
RngShp.End = RngShp.End - 1
Set RngDoc = .Anchor
RngDoc.Collapse wdCollapseEnd
RngDoc.FormattedText = RngShp.FormattedText
.Delete
End If
End With
Next
End With
End Sub
Do note that whether the macro positions the output correctly depends on where the textboxes are anchored; if the anchor positions are unrelated to the textbox locations, you'll end up with a dog's breakfast. You'll probably still also end up with each line as its own paragraph. To clean up such content, see http://www.msofficeforums.com/word/29880-cleaning-up-text-pasted-websites-e-mails.html