The intention is to transcribe .docx files to have modified font and font sizes while keeping the run attributes such as bold, underline, italic etc. I'll then add some headers and graphics to the newly created target.docx files
How to reconstruct paragraphs from runs? Each one, currently, gets it's own separate line!
from docx import Document
from docx.shared import Pt
def main(filename):
try:
src_doc = Document(filename)
trg_doc = Document()
style = trg_doc.styles['Normal']
font = style.font
font.name = 'Times'
font.size = Pt(11)
for p_cnt in range(len(src_doc.paragraphs)):
for r_cnt in range(len(src_doc.paragraphs[p_cnt].runs)):
curr_run = src_doc.paragraphs[p_cnt].runs[r_cnt]
print('Run: ', curr_run.text)
paragraph = trg_doc.add_paragraph()
if curr_run.bold:
paragraph.add_run(curr_run.text).bold = True
elif curr_run.italic:
paragraph.add_run(curr_run.text).italic = True
elif curr_run.underline:
paragraph.add_run(curr_run.text).underline = True
else:
paragraph.add_run(curr_run.text)
trg_doc.save('../Output/the_target.docx')
except IOError:
print('There was an error opening the file')
if __name__ == '__main__':
main("../Input/Current_File.docx
Input:
1.0 PURPOSE The purpose of this procedure is to ensure all feedback is logged, documented and any resulting complaints are received, evaluated, and reviewed in accordance with 21 CFR Part 820 and ISO 13485
Output:
PURPOSE The purpose of this procedure is to ensure
all feedback is logged,
documented and any resulting complaints are received,
evaluated, and reviewed
in accordance with 21 CFR P art 820
and ISO 13485 .
You're adding a new paragraph for each run. Your core loop needs to look more like this:
for src_paragraph in src_doc.paragraphs:
tgt_paragraph = tgt_doc.add_paragraph()
for src_run in src_paragraph.runs:
print('Run: ', src_run.text)
tgt_run = tgt_paragraph.add_run(src_run.text)
if src_run.bold:
tgt_run.bold = True
if src_run.italic:
tgt_run.italic = True
if src_run.underline:
tgt_run.underline = True