Search code examples
pythonpython-docx

Transcribe .docx files via python-docx to modify font and font size. Need to reconstruct paragraphs in target files


The intention is to transcribe .docx files to have modified font and font sizes while keeping the run attributes such as bold, underline, italic etc. I'll then add some headers and graphics to the newly created target.docx files

How to reconstruct paragraphs from runs? Each one, currently, gets it's own separate line!

from docx import Document
from docx.shared import Pt

def main(filename):
    try:
        src_doc = Document(filename)
        trg_doc = Document()

        style = trg_doc.styles['Normal']
        font = style.font
        font.name = 'Times'
        font.size = Pt(11)

        for p_cnt in range(len(src_doc.paragraphs)):
            for r_cnt in range(len(src_doc.paragraphs[p_cnt].runs)):
                curr_run = src_doc.paragraphs[p_cnt].runs[r_cnt]
                print('Run: ', curr_run.text)
                paragraph = trg_doc.add_paragraph()

                if curr_run.bold:
                    paragraph.add_run(curr_run.text).bold = True
                elif curr_run.italic:
                    paragraph.add_run(curr_run.text).italic = True
                elif curr_run.underline:
                    paragraph.add_run(curr_run.text).underline = True
                else:
                    paragraph.add_run(curr_run.text)

        trg_doc.save('../Output/the_target.docx')

    except IOError:
        print('There was an error opening the file')

if __name__ == '__main__':
    main("../Input/Current_File.docx

Input:

1.0 PURPOSE The purpose of this procedure is to ensure all feedback is logged, documented and any resulting complaints are received, evaluated, and reviewed in accordance with 21 CFR Part 820 and ISO 13485

Output:

PURPOSE The purpose of this procedure is to ensure

all feedback is logged,

documented and any resulting complaints are received,

evaluated, and reviewed

in accordance with 21 CFR P art 820

and ISO 13485 .

Solution

  • You're adding a new paragraph for each run. Your core loop needs to look more like this:

    for src_paragraph in src_doc.paragraphs:
        tgt_paragraph = tgt_doc.add_paragraph()
        for src_run in src_paragraph.runs:
            print('Run: ', src_run.text)
            tgt_run = tgt_paragraph.add_run(src_run.text)
            if src_run.bold:
                tgt_run.bold = True
            if src_run.italic:
                tgt_run.italic = True
            if src_run.underline:
                tgt_run.underline = True